diweshojha.com

Intro to Pandas

October 10, 2018 | by ojhadiwesh@gmail.com

OIG4

One of the most useful and important library that we use in data analysis is the Pandas library of python. It is a super fast tool for quickly performing manipulations, transformations and modelling of data. This article will focus on introducing some of the basic functions of pandas by using them on a simple data set.

Let’s first introduce the data set, this is the churn data of a telecom company that wants to analyze why a customer chooses to change operators. So doing some meaningful analysis on this data would help the company make decisions that would help them in retaining customers.

  • Main data structures of Pandas are implemented with Series and Dataframe Classes
  • Series is a one dimensional indexed array of some fixed data type
  • Dataframe is a two dimensional data structure – tableĀ 

We can import multiple data types such as csv, excel, txt, hdf5 files in the workspace using the pd.read_….. function. Since our dataset is in csv format we are using the pd.read_csv(‘path’) to import the data in the workspace. This creates a dataframe object that can be manipulated as we wish.

The head() and tail() functions are used to look the at the top and bottom rows for looking at the data. shape, info() and columns are used to perform some initial analysis just to understand whether we have the data dimensions that we were expecting. info() also provides the datatype of the column and null attributes.

describe() provides the statistical analysis of the numerical columns which includes the mean, mode, variance, count, quartiles, min and max values. This helps a lot if you have to make some quick calculations.

But describe() is limited to numerical columns, for non-numerical columns we need to specify the datatype as ‘object’. df.describe(include=[‘object’])

Distribution of Churn variable: We can see that the .value_counts() gives us the distribution in the whole data set, we can also set normalize argument as true to gives us the normal distribution.

Unique payment menthods: To see how many unique payment methods that customer use to pay we can use the .unique() and .nunique() method.

Data Transformations: We can easily perform data transformations such as addition, multilication, division etc between the different columns. Although we have to careful and understand that these transformations are not applied to the dataset rightaway. We are miltiplying the tenure of the customer with the monthly charges to get the total charge that they have paid.

Retrieving and Indexing data: we can use a lot of functions to get values from the data set. The next snippet shows how to do that.

df.set_index(‘CustomerID’,inplace=True) sets the customer ID as the index which makes more sense rather than a auto generated id. We can use this index to retrieve other values related to that customer.

Sorting

Handling Missing Values

Finding missing values :

Filling missing values:

Dropping missing values: By specifying axis=1 we can drop the columns. Argument ‘how=’ specifies the way we want to drop the missing values. ‘how=any’ specifies that if there are any column which has a missing value it will be dropped. ‘how=all’ specifies that only if all the values in the column are missing then that row will be dropped.

Grouping: Grouping is one of the most important steps in data analysis and provides a lot of important insights. In pandas grouping works by .groupby() function, we have to specify the column in which we want the data to group. We use the .agg() to specify the operations ( min, max, mean etc.) that we want to perform on the grouped data. We can also group using multiple columns.

Pivot table is another function that can be used extensively to perform analysis. Although there are lot of operations that can be performed and it is rather difficult to point out all of them. One of the simple implementation is given below where we create a pivot table using partners and customer ID.

I hope this small tutorials helps you in starting off to learn more about pandas. This notebook was part of a meetup presentation that Phaneendra and I did over the summer.

RELATED POSTS

View all

view all