4. Reshape with pandas

This lesson is going to cover some basic ways that pandas can help you reshape your DataFrame. If you work frequently with Excel and VBA, these functions might look pretty familiar.

Exploding!

One helpful way to reshape your DataFrame is through the pandas function explode(). This function “explodes” or expands each element of a list to a row, replicating the index values.

New function

df.explode(): Transform each element of a list-like to a row, replicating index values.

Let’s look at the example below. In this example, we have a DataFrame with two columns, item_identifier and committee_member. The committee_member column contains lists of committee members for each thesis.

If we wanted to evaluate the names individually, but don’t want to lose their relationship to the item identifiers, explode() is a great option.

sampleDataCommitteeMembers (not exploded)

item_identifier

committee_member

thesis_001

Leung, Anthony K. L.|Matunis, Michael J.|Dinglasan, Rhoel R.|Goodlett, David

thesis_002

Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Alexis|Jaffe, Andrew

thesis_003

West, Keith P., Jr.|Talegawkar, Sameera|Fanzo, Jessica C.

thesis_004

Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joann N.|Taylor, Holly A.

sampleDataCommitteeMembers (exploded)

item_identifier

committee_member

thesis_001

Leung, Anthony K. L.

thesis_001

Matunis, Michael J.

thesis_001

Dinglasan, Rhoel R.

thesis_001

Goodlett, David

thesis_002

Leek, Jeffrey T.

thesis_002

Hansen, Kasper D.

thesis_002

Battle, Alexis

thesis_002

Jaffe, Andrew

thesis_003

West, Keith P., Jr.

thesis_003

Talegawkar, Sameera

thesis_003

Fanzo, Jessica C.

thesis_004

Kass, Nancy E.

thesis_004

Saloner, Brendan

thesis_004

Bodurtha, Joann N.

thesis_004

Taylor, Holly A.

Let’s explore the code that makes this possible.

import pandas as pd

filename = 'sampleDataCommitteeMembers.csv'
df = pd.read_csv(filename, header=0)

df['committee_member'] = df['committee_member'].str.split('|')
df = df.explode('committee_member')

print(df.head())
df.to_csv('explodedSampleDataCommitteeMembers.csv', index=False)
  item_identifier      committee_member
0      thesis_001  Leung, Anthony K. L.
0      thesis_001   Matunis, Michael J.
0      thesis_001   Dinglasan, Rhoel R.
0      thesis_001       Goodlett, David
1      thesis_002      Leek, Jeffrey T.

First, we read our DataFrame as the variable df.

Next, we read converted the strings in the committee_member Series into Python lists using the Series.str.split() function.

New function

Series.str.split(): Splits string into a list around given separator/delimiter.

Once Python understands our committee_member Series values as lists, we use the pandas function explode() to expand each element of a list to a row while keeping the element associated with the list’s original identifier.

Finally, we printed a sample of the exploded DataFrame and saved it as a CSV called explodedSampleDataCommitteeMembers.csv.

Pivot!

Before pandas, I had no conceptualization of what pivoting meant. How could you pivot a table? Are you turning it? What is going on?

In this section, we are going to cover two different types of pivot functions offered by pandas and how they can help you reshape your DataFrame.

pivot()

First, let’s talk about the pandas pivot() function, which provides general purpose pivoting for your DataFrame.

New function

pivot(): Reshape data (produce a “pivot” table) based on column values.

Because pivot() has so much flexbility in how it can reshape your DataFrame, I find it helpful to think about what pivot can do rather than focus on a definition.

  • pivot() can transform Series valuesSeries headers/labels

  • pivot() can change what values acts as the DataFrame index

  • pivot() can select what Series values will populate the pivoted DataFrame

Let’s use the example below. Let’s say you want to restructure your spreadsheet so you can easily view the theses by year. You can use the three pivot() parameters of columns, index and values to:

  1. Organize the DataFrame by year

  2. Use item_identifier as the index

  3. Fill the DataFrame with title values

Written as code, this means we want our pivot() function to look like this:

pivot = df.pivot(index='item_identifier', columns='year', values='title')

sampleDataYear (not pivoted)

item_identifier

year

title

degree_grantor

thesis_001

2015

Characterization of the ADP-ribosylated proteome by mass spectrometry

Johns Hopkins University. Bloomberg School of Public Health

thesis_002

2016

Annotation-Agnostic Differential Expression and Binding Analyses

Johns Hopkins University. Bloomberg School of Public Health

thesis_003

2016

Child diet over three seasons in rural Zambia: Assessments of usual nutrient intake adequacy, components of intake variation and dietary diversity score performance

Johns Hopkins University. Bloomberg School of Public Health

thesis_004

2017

Exploring Parental Involvement in Rare Disease Research and Advocacy

Johns Hopkins University. Bloomberg School of Public Health

Here’s what our new DataFrame will look like. As you can see, the years are now acting as the Series labels, the item_identifiers are serving as the index, and values of title have been filled in where applicable.

sampleDataYear (pivoted)

item_identifier

2013

2014

2015

2016

2017

2018

2019

2020

2021

thesis_001

Characterization of the ADP-ribosylated proteome by mass spectrometry

thesis_002

Annotation-Agnostic Differential Expression and Binding Analyses

thesis_003

Child diet over three seasons in rural Zambia: Assessments of usual nutrient intake adequacy, components of intake variation and dietary diversity score performance

thesis_004

Exploring Parental Involvement in Rare Disease Research and Advocacy

You’ll notice that we lost some information doing this pivot – we no longer have the degree_grantor information. That’s because we only selected the title Series to fill in our DataFrame. We could select more columns, but then we will get hierarchical columns within our DataFrame. We won’t get into hierarchical columns in this lesson (or ever, honestly, they are horrible), but it’s another fun layer of complexity to explore if you want.

Let’s explore the code that makes this possible.

import pandas as pd

filename = 'sampleDataYear.csv'
df = pd.read_csv(filename, header=0)

df_p = df.pivot(index='item_identifier', columns='year', values='title')
print(df_p.head())

df_p.to_csv('sampleDataYearPivoted.csv')
year            2013 2014                                               2015  \
item_identifier                                                                
thesis_001       NaN  NaN  Characterization of the ADP-ribosylated proteo...   
thesis_002       NaN  NaN                                                NaN   
thesis_003       NaN  NaN                                                NaN   
thesis_004       NaN  NaN                                                NaN   
thesis_005       NaN  NaN                                                NaN   

year                                                          2016  \
item_identifier                                                      
thesis_001                                                     NaN   
thesis_002       Annotation-Agnostic Differential Expression an...   
thesis_003       Child diet over three seasons in rural Zambia:...   
thesis_004                                                     NaN   
thesis_005                                                     NaN   

year                                                          2017  \
item_identifier                                                      
thesis_001                                                     NaN   
thesis_002                                                     NaN   
thesis_003                                                     NaN   
thesis_004       Exploring Parental Involvement in Rare Disease...   
thesis_005                                                     NaN   

year                                                          2018 2019 2020  \
item_identifier                                                                
thesis_001                                                     NaN  NaN  NaN   
thesis_002                                                     NaN  NaN  NaN   
thesis_003                                                     NaN  NaN  NaN   
thesis_004                                                     NaN  NaN  NaN   
thesis_005       Gut Microbes, Enteropathy and Child Growth: Th...  NaN  NaN   

year            2021  
item_identifier       
thesis_001       NaN  
thesis_002       NaN  
thesis_003       NaN  
thesis_004       NaN  
thesis_005       NaN  

First, we read our DataFrame as the variable df. Next, we pivoted df and created a variable df_p to represent our pivoted DataFrame. We printed out a sample of df_p and then created a CSV of df_p called sampleDataYearPivoted.csv.

pivot_table()

To make this even more fun, pandas has a related function called pivot_table() which is basically a specific-type of pivot() that can aggregate values together according to specified functions.

New function

pd.pivot_table(): Create a spreadsheet-style pivot table as a DataFrame.

Because that’s quite a mouthful, here’s a more casual way to think about the differences:

pivot()

  • reshapes a DataFrame

pivot_table()

  • reshapes a DataFrame

  • squishes or aggregates values together if needed

The pivot_table() function has 10 parameters, 4 of which we are going to focus on here. Some will look familiar from pivot().

  • data: the DataFrame to pivot

  • values: column to aggregate

  • index : column that will act as the index

  • aggfunc: the function applied to values

So let’s take a look at explodedSampleDataCommitteeMembers.csv, which we created through the melt function earlier. What if we wanted a list of all the unique committee_members with all of their associated item_identifiers? With pivot_table(), we can easily reshape the spreadsheet to create a new view of the data.

explodedSampleDataCommitteeMembers (not pivoted)

item_identifier

committee_member

thesis_001

Leung, Anthony K. L.

thesis_001

Matunis, Michael J.

thesis_001

Dinglasan, Rhoel R.

thesis_001

Goodlett, David

thesis_002

Leek, Jeffrey T.

thesis_002

Hansen, Kasper D.

thesis_002

Battle, Alexis

thesis_002

Jaffe, Andrew

thesis_003

West, Keith P., Jr.

thesis_003

Talegawkar, Sameera

thesis_003

Fanzo, Jessica C.

thesis_004

Kass, Nancy E.

thesis_004

Saloner, Brendan

thesis_004

Bodurtha, Joann N.

thesis_004

Taylor, Holly A.

explodedSampleDataCommitteeMembers (pivoted)

committee_member

item_identifier

Achinstein, Sharon

thesis_036

Allan, Bentley B.

thesis_019

Angelini, Alessandro

thesis_049

Arora, Raman

thesis_087|thesis_087

Athreya, Avanti

thesis_072

Bah, Ibrahima

thesis_038

Basu, Amitabh

thesis_038|thesis_050|thesis_083|thesis_086

Here’s a type of pivot_table() I use all the time, which aggregates string values together into a single column using lambda and join().

pivot = pd.pivot_table(df, index='committee_member',
                       values='item_identifier',
                       aggfunc=lambda x: '|'.join(str(v) for v in x))

Let’s explore the code that makes this possible.

import pandas as pd

filename = 'explodedSampleDataCommitteeMembers.csv'
df_1 = pd.read_csv(filename, header=0)

pivot = pd.pivot_table(df_1, index='committee_member',
                       values='item_identifier',
                       aggfunc=lambda x: '|'.join(str(v) for v in x))

df_p = pd.DataFrame(pivot)
df_p = df_p.reset_index()
print(df_p.head())

df_p.to_csv('pivotedByCommitteeMembers.csv', index=False)
       committee_member        item_identifier
0    Achinstein, Sharon             thesis_036
1     Allan, Bentley B.             thesis_019
2  Angelini, Alessandro             thesis_049
3          Arora, Raman  thesis_087|thesis_087
4       Athreya, Avanti             thesis_072

What is lambda?

New function

lambda: An anonymous (unnamed) function that applies arguments to various parameters and returns an expression (outcome).

Let’s take a closer look at what lambda is doing in the example above. In particular, let’s look at the line below.

lambda x: '|'.join(str(v) for v in x)

This line of code is essentially a small function to transform a list into a string. In plain language, this function says:

    For a list named x:
        Loop through x and convert individual list components (named v) into strings
        Convert list x into one string with components separated by '|' (pipes)

If we wrote it out it as a regular function, it would look like this:

def lambda(x):  
   new_x = []
   for v in x:
    v = str(v)
    new_x.append(v)
  '|'.join(new_x)
  x = new_x
  return x

We can also imagine it as a list comprehension.

x = [str(v) for v in x]

Melting!

Another great way to reshape a DataFrame is through the melt() function. This function (which serves as a type of un-pivoting) changes your DataFrame from a “wide” to “long” format, reducing your DataFrame to only two non-identifier columns: a variable column and a value column.

New function

melt(): Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

(For Linked Data geeks: Maybe this is nonsense, but I find it helpful to visualize a melted DataFrame as “What if each row in my spreadsheet was a triple?” )

It’s easiest to understand this through an example.

marylandPlaces (not melted)

local_identifier

name

fast

geonames

viaf

geo_001

Maryland

http://id.worldcat.org/fast/1204739

http://sws.geonames.org/4361885/

http://viaf.org/viaf/151255387

geo_002

Maryland–Annapolis

http://id.worldcat.org/fast/1206088

http://sws.geonames.org/4347242/

http://viaf.org/viaf/123791303

geo_003

Maryland–Baltimore

http://id.worldcat.org/fast/1204292

http://sws.geonames.org/4347778/

http://viaf.org/viaf/235452840

marylandPlaces (melted)

local_identifier

variable

value

geo_001

name

Maryland

geo_001

fast

http://id.worldcat.org/fast/1204739

geo_001

geonames

http://sws.geonames.org/4361885/

geo_001

viaf

http://viaf.org/viaf/151255387

geo_002

name

Maryland–Annapolis

geo_002

fast

http://id.worldcat.org/fast/1206088

geo_002

geonames

http://sws.geonames.org/4347242/

geo_002

viaf

http://viaf.org/viaf/123791303

geo_003

name

Maryland–Baltimore

geo_003

fast

http://id.worldcat.org/fast/1204292

geo_003

geonames

http://sws.geonames.org/4347778/

geo_003

viaf

http://viaf.org/viaf/235452840

Let’s explore the code that makes this possible.

import pandas as pd

filename = 'marylandPlaces.csv'
df_1 = pd.read_csv(filename, header=0)

df_1 = df_1.melt(id_vars=['local_identifier'])
print(df_1.head())

df_1.to_csv('meltedMarylandPlaces.csv')
  local_identifier                     variable  \
0          geo_001                name            
1          geo_002                name            
2          geo_003                name            
3          geo_001  fast                          
4          geo_002  fast                          

                                 value  
0                 Maryland              
1                  Maryland--Annapolis  
2                  Maryland--Baltimore  
3  http://id.worldcat.org/fast/1204739  
4  http://id.worldcat.org/fast/1206088