Turing Talks
Posts
Issue #5: PanDas in Depth— Building Flexible DataFrames

Issue #5: PanDas in Depth— Building Flexible DataFrames

January 05, 2022

When you are working on Machine learning projects, you’ll be handling a large amount of data. You will have to perform a lot of complicated operations on those data like merging, sorting, normalizing, and so on.

There are few tools that will help you accomplish this but Pandas stands out as the best. When you are building machine learning models, it is highly probable that you will be using python. Pandas is a python library that helps you build flexible data frames.

In this article, we will go through Pandas in detail and look at some of the ways you can manipulate data using Pandas.

If you want to code along with this article, here is a google collab notebook with all the code.

What is Pandas?

Pandas stands for “Panel DataFrames”. Simply put, you can work with huge data sets using Pandas. If you don't know what a data frame is, here is how it looks like.

Looks similar to data in an excel spreadsheet, but as a machine learning engineer, you would rather use a python library to work with data on the fly compared to an excel sheet.

Here are a few things you can do with Pandas:

Feature Engineering — Building new columns based on existing columns
Conditional Selection —Select data based on conditions
Summarize Data — Calculate mean, median, mode, and other statistical computations.
Manage Missing Data — Fill / remove missing data from data frames.
Group Data — Perform “groupby” operations on data frames.

In addition to these, Pandas lets you do many other useful operations on data like reading data from files, applying a function to each value in a row, and so on.

Another great feature of Pandas is that it works seamlessly with NUmpy. Numpy is a python library for performing large-scale numerical computations. I wrote a detailed article on Numpy as well, so you can check it out if you are interested.

Now that you know what Pandas is and what it is for, let's write some code.

Working with Pandas

Let's first import Pandas and Numpy.

import numpy as np
import pandas as pd

Now let’s declare some labels and values.

labels = ['a','b','c']
mylist = [10,20,30]

Pandas offers two types of data sets. Series and DataFrames. Series is similar to an array with a named index.

pd.Series(mylist,labels)
Output:
a 10
b 20
c 30
dtype: int64

A data frame is a matrix with named indexes (rows) and columns. Let's define the indexes and columns and then create a data frame. I'll use NumPy to generate some random values to fill our data frame.

columns = ['W','X','Y','Z']
index = ['A','B','C','D','E']
np.random.seed(42)
data = np.random.randint(-100,100,(5,4))
df = pd.DataFrame(data,index,columns)
print(df)Output:
    W   X   Y   ZA   2  79  -8 -86
B   6 -29  88 -80
C   6 -29  88 -80
D  16  -1   3  51
E  30  49 -48 -99

Great. Now that you know how to build a data frame and add values to it, lets do some operations on that data.

Slicing data

Let's get the first column as a Series.

print(df['W'])
Output:
A 2
B 6
C 6
D 16
E 30

Let's now get a smaller data frame.

print(df[['W','Z']])Output:
    W Z
A   2 -86
B   6 -80
C   2 -13
D  16  51
E  30 -99

To delete a column,

df = df.drop('new',axis=1)

By default, the axis is 0, which means a row. Axis = 1 makes pandas to look for columns instead of rows.

Feature Engineering

Feature engineering is the concept of building new columns (or features) from existing columns. This is very useful when you are training a machine learning model.

df['new'] = df['W'] + df['Y']
print(df)Output:
    W   X   Y   Z  new
A   2  79  -8 -86   -6
B   6 -29  88 -80   94
C   2  21 -26 -13  -24
D  16  -1   3  51   19
E  30  49 -48 -99  -18

Conditional Selection

Now let's look at conditional selection. Pandas lets you specify conditions and filter out the data. For example, the line below prints out elements where the column.

print(df[df['X'] > 0])Output:
    W   X   Y   Z
A   2  79  -8 -86
C   2  21 -26 -13
E  30  49 -48 -99

You can also use multiple/complex conditions with Pandas.

print(df[(df['W'] > 0) & (df['Y'] > 1)])Output:
    W   X   Y   Z
B   6 -29  88 -80
D  16  -1   3  51

Summarizing Data

Pandas has two useful functions that display additional information about your data frames. The first is the describe function that prints useful statistical information like mean, standard deviation, and other descriptive statistics. You can learn more about the describe function here.

print(df.describe())Output:
              W          X          Y          Z
count   5.00000   5.000000   5.000000   5.000000
mean   11.20000  23.800000   1.800000 -45.400000
std    11.96662  42.109381  51.915316  63.366395
min     2.00000 -29.000000 -48.000000 -99.000000
25%     2.00000  -1.000000 -26.000000 -86.000000
50%     6.00000  21.000000  -8.000000 -80.000000
75%    16.00000  49.000000   3.000000 -13.000000
max    30.00000  79.000000  88.000000  51.000000

Another useful function is the info function which prints information about data types and memory usage of the data frame.

print(df.info())Output:
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   W        5 non-null      int64
 1   X        5 non-null      int64
 2   Y        5 non-null      int64
 3   Z        5 non-null      int64
 4   indexes  5 non-null      object
dtypes: int64(4), object(1)
memory usage: 400.0+ bytes

Handling Missing Data

When you are collecting real-world data for building machine learning models, it's likely that there will be a lot of missing data points. With Pandas, you can easily filter out or substitute values for those missing data points.

Let's define a new data frame with some missing data.

new_df = pd.DataFrame({'A':[1,2,np.nan,4],'B':[5,np.nan,np.nan,8],'C':[10,20,30,40]})print(new_df)Output:
     A    B   C
0  1.0  5.0  10
1  2.0  NaN  20
2  NaN  NaN  30
3  4.0  8.0  40

There are two main missing data operations that you can do with pandas. You can either fill the missing values with new data or just skip the null values.

To drop the rows with null values, use the dropna() function.

print(new_df.dropna())Output:
     A    B   C
0  1.0  5.0  10
3  4.0  8.0  40

Let's try to fill the missing values with data. You can use the fillna() function to do that.

# Fill with zeros
print(new_df.fillna(0))Output:
     A    B   C
0  1.0  5.0  10
1  2.0  6.5  20
2  0.0  6.5  30
3  4.0  8.0  40# Fill with mean value of dataframe
print(new_df.fillna(new_df.mean()))Output:
          A    B   C
0  1.000000  5.0  10
1  2.000000  6.5  20
2  2.333333  6.5  30
3  4.000000  8.0  40

Reading from Files

Pandas offers support for reading data from files including CSV, JSON, and others. Let's read some data from a CSV file.

df = pd.read_csv('Universities.csv')

The read_csv function reads a CSV file directly into a Pandas data frame. To quickly view the first 5 rows, you can use the head() function.

df.head()Output:
                                 Sector  ... Geography
0            Private for-profit, 2-year  ...    Nevada
1  Private for-profit, less-than 2-year  ...    Nevada
2  Private for-profit, less-than 2-year  ...    Nevada
3  Private for-profit, less-than 2-year  ...    Nevada
4               Public, 4-year or above  ...    Nevada

You can learn more about working with files using Pandas here.

Grouping Data

Let’s group some data now. Grouping is an operation that allows you to split your data into separate groups to perform specific computations. For example, if you have a list of students who went to different universities, you can group the students by those universities.

Let's use the file that we loaded and group it by the year to get the sum of completions.

print(df.groupby('Year').sum())Output:
      Completions
Year
2012        20333
2013        21046
2014        24730
2015        26279
2016        26224

You can group by multiple columns as well.

print(df.groupby(['Year','Sector']).sum())

And here is how it looks when you group by both year and Sector.

Summary

Pandas is a very useful and powerful library that helps machine learning engineers efficiently work with data. In addition to providing functions that let you manipulate the data, it also offers great tools to handle use cases like missing values and aggregations. Pandas is an invaluable tool in every machine learning engineer’s toolkit.

Hope you enjoyed this article. If you have any questions, let me know in the comments. See you soon with a new topic.

Reply

or to participate.