- Turing Talks
- Posts
- Issue #5: PanDas in Depth— Building Flexible DataFrames
Issue #5: PanDas in Depth— Building Flexible DataFrames
When you are working on Machine learning projects, you’ll be handling a large amount of data. You will have to perform a lot of complicated operations on those data like merging, sorting, normalizing, and so on.
There are few tools that will help you accomplish this but Pandas stands out as the best. When you are building machine learning models, it is highly probable that you will be using python. Pandas is a python library that helps you build flexible data frames.
In this article, we will go through Pandas in detail and look at some of the ways you can manipulate data using Pandas.
If you want to code along with this article, here is a google collab notebook with all the code.
What is Pandas?
Pandas stands for “Panel DataFrames”. Simply put, you can work with huge data sets using Pandas. If you don't know what a data frame is, here is how it looks like.
Looks similar to data in an excel spreadsheet, but as a machine learning engineer, you would rather use a python library to work with data on the fly compared to an excel sheet.
Here are a few things you can do with Pandas:
Feature Engineering — Building new columns based on existing columns
Conditional Selection —Select data based on conditions
Summarize Data — Calculate mean, median, mode, and other statistical computations.
Manage Missing Data — Fill / remove missing data from data frames.
Group Data — Perform “groupby” operations on data frames.
In addition to these, Pandas lets you do many other useful operations on data like reading data from files, applying a function to each value in a row, and so on.
Another great feature of Pandas is that it works seamlessly with NUmpy. Numpy is a python library for performing large-scale numerical computations. I wrote a detailed article on Numpy as well, so you can check it out if you are interested.
Now that you know what Pandas is and what it is for, let's write some code.
Working with Pandas
Let's first import Pandas and Numpy.
import numpy as np
import pandas as pd
Now let’s declare some labels and values.
labels = ['a','b','c']
mylist = [10,20,30]
Pandas offers two types of data sets. Series and DataFrames. Series is similar to an array with a named index.
pd.Series(mylist,labels)
Output:
a 10
b 20
c 30
dtype: int64
A data frame is a matrix with named indexes (rows) and columns. Let's define the indexes and columns and then create a data frame. I'll use NumPy to generate some random values to fill our data frame.
columns = ['W','X','Y','Z']
index = ['A','B','C','D','E']
np.random.seed(42)
data = np.random.randint(-100,100,(5,4))
df = pd.DataFrame(data,index,columns)
print(df)Output:
W X Y ZA 2 79 -8 -86
B 6 -29 88 -80
C 6 -29 88 -80
D 16 -1 3 51
E 30 49 -48 -99
Great. Now that you know how to build a data frame and add values to it, lets do some operations on that data.
Slicing data
Let's get the first column as a Series.
print(df['W'])
Output:
A 2
B 6
C 6
D 16
E 30
Let's now get a smaller data frame.
print(df[['W','Z']])Output:
W Z
A 2 -86
B 6 -80
C 2 -13
D 16 51
E 30 -99
To delete a column,
df = df.drop('new',axis=1)
By default, the axis is 0, which means a row. Axis = 1 makes pandas to look for columns instead of rows.
Feature Engineering
Feature engineering is the concept of building new columns (or features) from existing columns. This is very useful when you are training a machine learning model.
df['new'] = df['W'] + df['Y']
print(df)Output:
W X Y Z new
A 2 79 -8 -86 -6
B 6 -29 88 -80 94
C 2 21 -26 -13 -24
D 16 -1 3 51 19
E 30 49 -48 -99 -18
Conditional Selection
Now let's look at conditional selection. Pandas lets you specify conditions and filter out the data. For example, the line below prints out elements where the column.
print(df[df['X'] > 0])Output:
W X Y Z
A 2 79 -8 -86
C 2 21 -26 -13
E 30 49 -48 -99
You can also use multiple/complex conditions with Pandas.
print(df[(df['W'] > 0) & (df['Y'] > 1)])Output:
W X Y Z
B 6 -29 88 -80
D 16 -1 3 51
Summarizing Data
Pandas has two useful functions that display additional information about your data frames. The first is the describe function that prints useful statistical information like mean, standard deviation, and other descriptive statistics. You can learn more about the describe function here.
print(df.describe())Output:
W X Y Z
count 5.00000 5.000000 5.000000 5.000000
mean 11.20000 23.800000 1.800000 -45.400000
std 11.96662 42.109381 51.915316 63.366395
min 2.00000 -29.000000 -48.000000 -99.000000
25% 2.00000 -1.000000 -26.000000 -86.000000
50% 6.00000 21.000000 -8.000000 -80.000000
75% 16.00000 49.000000 3.000000 -13.000000
max 30.00000 79.000000 88.000000 51.000000
Another useful function is the info function which prints information about data types and memory usage of the data frame.
print(df.info())Output:
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 W 5 non-null int64
1 X 5 non-null int64
2 Y 5 non-null int64
3 Z 5 non-null int64
4 indexes 5 non-null object
dtypes: int64(4), object(1)
memory usage: 400.0+ bytes
Handling Missing Data
When you are collecting real-world data for building machine learning models, it's likely that there will be a lot of missing data points. With Pandas, you can easily filter out or substitute values for those missing data points.
Let's define a new data frame with some missing data.
new_df = pd.DataFrame({'A':[1,2,np.nan,4],'B':[5,np.nan,np.nan,8],'C':[10,20,30,40]})print(new_df)Output:
A B C
0 1.0 5.0 10
1 2.0 NaN 20
2 NaN NaN 30
3 4.0 8.0 40
There are two main missing data operations that you can do with pandas. You can either fill the missing values with new data or just skip the null values.
To drop the rows with null values, use the dropna() function.
print(new_df.dropna())Output:
A B C
0 1.0 5.0 10
3 4.0 8.0 40
Let's try to fill the missing values with data. You can use the fillna() function to do that.
# Fill with zeros
print(new_df.fillna(0))Output:
A B C
0 1.0 5.0 10
1 2.0 6.5 20
2 0.0 6.5 30
3 4.0 8.0 40# Fill with mean value of dataframe
print(new_df.fillna(new_df.mean()))Output:
A B C
0 1.000000 5.0 10
1 2.000000 6.5 20
2 2.333333 6.5 30
3 4.000000 8.0 40
Reading from Files
Pandas offers support for reading data from files including CSV, JSON, and others. Let's read some data from a CSV file.
df = pd.read_csv('Universities.csv')
The read_csv function reads a CSV file directly into a Pandas data frame. To quickly view the first 5 rows, you can use the head() function.
df.head()Output:
Sector ... Geography
0 Private for-profit, 2-year ... Nevada
1 Private for-profit, less-than 2-year ... Nevada
2 Private for-profit, less-than 2-year ... Nevada
3 Private for-profit, less-than 2-year ... Nevada
4 Public, 4-year or above ... Nevada
You can learn more about working with files using Pandas here.
Grouping Data
Let’s group some data now. Grouping is an operation that allows you to split your data into separate groups to perform specific computations. For example, if you have a list of students who went to different universities, you can group the students by those universities.
Let's use the file that we loaded and group it by the year to get the sum of completions.
print(df.groupby('Year').sum())Output:
Completions
Year
2012 20333
2013 21046
2014 24730
2015 26279
2016 26224
You can group by multiple columns as well.
print(df.groupby(['Year','Sector']).sum())
And here is how it looks when you group by both year and Sector.
Summary
Pandas is a very useful and powerful library that helps machine learning engineers efficiently work with data. In addition to providing functions that let you manipulate the data, it also offers great tools to handle use cases like missing values and aggregations. Pandas is an invaluable tool in every machine learning engineer’s toolkit.
Hope you enjoyed this article. If you have any questions, let me know in the comments. See you soon with a new topic.
Reply