4. Reshape with pandas
Contents
4. Reshape with pandas¶
This lesson is going to cover some basic ways that pandas can help you reshape your DataFrame
. If you work frequently with Excel and VBA, these functions might look pretty familiar.
Exploding!¶
One helpful way to reshape your DataFrame
is through the pandas function explode()
. This function “explodes” or expands each element of a list to a row, replicating the index values.
New function
df.explode()
: Transform each element of a list-like to a row, replicating index values.
Let’s look at the example below. In this example, we have a DataFrame
with two columns, item_identifier
and committee_member
. The committee_member
column contains lists of committee members for each thesis.
If we wanted to evaluate the names individually, but don’t want to lose their relationship to the item identifiers, explode()
is a great option.
Files
sampleDataCommitteeMembers (not exploded)
item_identifier |
committee_member |
---|---|
thesis_001 |
Leung, Anthony K. L.|Matunis, Michael J.|Dinglasan, Rhoel R.|Goodlett, David |
thesis_002 |
Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Alexis|Jaffe, Andrew |
thesis_003 |
West, Keith P., Jr.|Talegawkar, Sameera|Fanzo, Jessica C. |
thesis_004 |
Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joann N.|Taylor, Holly A. |
sampleDataCommitteeMembers (exploded)
item_identifier |
committee_member |
---|---|
thesis_001 |
Leung, Anthony K. L. |
thesis_001 |
Matunis, Michael J. |
thesis_001 |
Dinglasan, Rhoel R. |
thesis_001 |
Goodlett, David |
thesis_002 |
Leek, Jeffrey T. |
thesis_002 |
Hansen, Kasper D. |
thesis_002 |
Battle, Alexis |
thesis_002 |
Jaffe, Andrew |
thesis_003 |
West, Keith P., Jr. |
thesis_003 |
Talegawkar, Sameera |
thesis_003 |
Fanzo, Jessica C. |
thesis_004 |
Kass, Nancy E. |
thesis_004 |
Saloner, Brendan |
thesis_004 |
Bodurtha, Joann N. |
thesis_004 |
Taylor, Holly A. |
Let’s explore the code that makes this possible.
import pandas as pd
filename = 'sampleDataCommitteeMembers.csv'
df = pd.read_csv(filename, header=0)
df['committee_member'] = df['committee_member'].str.split('|')
df = df.explode('committee_member')
print(df.head())
df.to_csv('explodedSampleDataCommitteeMembers.csv', index=False)
item_identifier committee_member
0 thesis_001 Leung, Anthony K. L.
0 thesis_001 Matunis, Michael J.
0 thesis_001 Dinglasan, Rhoel R.
0 thesis_001 Goodlett, David
1 thesis_002 Leek, Jeffrey T.
First, we read our DataFrame
as the variable df
.
Next, we read converted the strings in the committee_member
Series
into Python lists using the Series.str.split()
function.
New function
Series.str.split()
: Splits string into a list around given separator/delimiter.
Once Python understands our committee_member
Series
values as lists, we use the pandas function explode()
to expand each element of a list to a row while keeping the element associated with the list’s original identifier.
Finally, we printed a sample of the exploded DataFrame and saved it as a CSV called explodedSampleDataCommitteeMembers.csv.
Pivot!¶
Before pandas, I had no conceptualization of what pivoting meant. How could you pivot a table? Are you turning it? What is going on?
In this section, we are going to cover two different types of pivot functions offered by pandas and how they can help you reshape your DataFrame
.
pivot()
¶
First, let’s talk about the pandas pivot()
function, which provides general purpose pivoting for your DataFrame
.
New function
pivot()
: Reshape data (produce a “pivot” table) based on column values.
Because pivot()
has so much flexbility in how it can reshape your DataFrame
, I find it helpful to think about what pivot can do rather than focus on a definition.
pivot()
can transformSeries
values →Series
headers/labelspivot()
can change what values acts as theDataFrame
indexpivot()
can select whatSeries
values will populate the pivotedDataFrame
Let’s use the example below. Let’s say you want to restructure your spreadsheet so you can easily view the theses by year. You can use the three pivot()
parameters of columns
, index
and values
to:
Organize the
DataFrame
by yearUse item_identifier as the index
Fill the
DataFrame
with title values
Files
You can download sampleDataYear.csv here
.
Written as code, this means we want our pivot()
function to look like this:
pivot = df.pivot(index='item_identifier', columns='year', values='title')
sampleDataYear (not pivoted)
item_identifier |
year |
title |
degree_grantor |
---|---|---|---|
thesis_001 |
2015 |
Characterization of the ADP-ribosylated proteome by mass spectrometry |
Johns Hopkins University. Bloomberg School of Public Health |
thesis_002 |
2016 |
Annotation-Agnostic Differential Expression and Binding Analyses |
Johns Hopkins University. Bloomberg School of Public Health |
thesis_003 |
2016 |
Child diet over three seasons in rural Zambia: Assessments of usual nutrient intake adequacy, components of intake variation and dietary diversity score performance |
Johns Hopkins University. Bloomberg School of Public Health |
thesis_004 |
2017 |
Exploring Parental Involvement in Rare Disease Research and Advocacy |
Johns Hopkins University. Bloomberg School of Public Health |
Here’s what our new DataFrame
will look like. As you can see, the years are now acting as the Series labels, the item_identifiers are serving as the index, and values of title have been filled in where applicable.
sampleDataYear (pivoted)
item_identifier |
2013 |
2014 |
2015 |
2016 |
2017 |
2018 |
2019 |
2020 |
2021 |
---|---|---|---|---|---|---|---|---|---|
thesis_001 |
Characterization of the ADP-ribosylated proteome by mass spectrometry |
||||||||
thesis_002 |
Annotation-Agnostic Differential Expression and Binding Analyses |
||||||||
thesis_003 |
Child diet over three seasons in rural Zambia: Assessments of usual nutrient intake adequacy, components of intake variation and dietary diversity score performance |
||||||||
thesis_004 |
Exploring Parental Involvement in Rare Disease Research and Advocacy |
You’ll notice that we lost some information doing this pivot – we no longer have the degree_grantor information. That’s because we only selected the title Series
to fill in our DataFrame
. We could select more columns, but then we will get hierarchical columns within our DataFrame
. We won’t get into hierarchical columns in this lesson (or ever, honestly, they are horrible), but it’s another fun layer of complexity to explore if you want.
Let’s explore the code that makes this possible.
import pandas as pd
filename = 'sampleDataYear.csv'
df = pd.read_csv(filename, header=0)
df_p = df.pivot(index='item_identifier', columns='year', values='title')
print(df_p.head())
df_p.to_csv('sampleDataYearPivoted.csv')
year 2013 2014 2015 \
item_identifier
thesis_001 NaN NaN Characterization of the ADP-ribosylated proteo...
thesis_002 NaN NaN NaN
thesis_003 NaN NaN NaN
thesis_004 NaN NaN NaN
thesis_005 NaN NaN NaN
year 2016 \
item_identifier
thesis_001 NaN
thesis_002 Annotation-Agnostic Differential Expression an...
thesis_003 Child diet over three seasons in rural Zambia:...
thesis_004 NaN
thesis_005 NaN
year 2017 \
item_identifier
thesis_001 NaN
thesis_002 NaN
thesis_003 NaN
thesis_004 Exploring Parental Involvement in Rare Disease...
thesis_005 NaN
year 2018 2019 2020 \
item_identifier
thesis_001 NaN NaN NaN
thesis_002 NaN NaN NaN
thesis_003 NaN NaN NaN
thesis_004 NaN NaN NaN
thesis_005 Gut Microbes, Enteropathy and Child Growth: Th... NaN NaN
year 2021
item_identifier
thesis_001 NaN
thesis_002 NaN
thesis_003 NaN
thesis_004 NaN
thesis_005 NaN
First, we read our DataFrame
as the variable df
. Next, we pivoted df
and created a variable df_p
to represent our pivoted DataFrame
. We printed out a sample of df_p
and then created a CSV of df_p
called sampleDataYearPivoted.csv.
pivot_table()
¶
To make this even more fun, pandas has a related function called pivot_table()
which is basically a specific-type of pivot()
that can aggregate values together according to specified functions.
New function
pd.pivot_table()
: Create a spreadsheet-style pivot table as a DataFrame.
Because that’s quite a mouthful, here’s a more casual way to think about the differences:
pivot()
reshapes a
DataFrame
pivot_table()
reshapes a
DataFrame
squishes or aggregates values together if needed
The pivot_table()
function has 10 parameters, 4 of which we are going to focus on here. Some will look familiar from pivot()
.
data
: theDataFrame
to pivotvalues
: column to aggregateindex
: column that will act as the indexaggfunc
: the function applied to values
So let’s take a look at explodedSampleDataCommitteeMembers.csv, which we created through the melt function earlier. What if we wanted a list of all the unique committee_members
with all of their associated item_identifiers
? With pivot_table()
, we can easily reshape the spreadsheet to create a new view of the data.
Files
You can download explodedSampleDataCommitteeMembers.csv here
.
explodedSampleDataCommitteeMembers (not pivoted)
item_identifier |
committee_member |
---|---|
thesis_001 |
Leung, Anthony K. L. |
thesis_001 |
Matunis, Michael J. |
thesis_001 |
Dinglasan, Rhoel R. |
thesis_001 |
Goodlett, David |
thesis_002 |
Leek, Jeffrey T. |
thesis_002 |
Hansen, Kasper D. |
thesis_002 |
Battle, Alexis |
thesis_002 |
Jaffe, Andrew |
thesis_003 |
West, Keith P., Jr. |
thesis_003 |
Talegawkar, Sameera |
thesis_003 |
Fanzo, Jessica C. |
thesis_004 |
Kass, Nancy E. |
thesis_004 |
Saloner, Brendan |
thesis_004 |
Bodurtha, Joann N. |
thesis_004 |
Taylor, Holly A. |
explodedSampleDataCommitteeMembers (pivoted)
committee_member |
item_identifier |
---|---|
Achinstein, Sharon |
thesis_036 |
Allan, Bentley B. |
thesis_019 |
Angelini, Alessandro |
thesis_049 |
Arora, Raman |
thesis_087|thesis_087 |
Athreya, Avanti |
thesis_072 |
Bah, Ibrahima |
thesis_038 |
Basu, Amitabh |
thesis_038|thesis_050|thesis_083|thesis_086 |
Here’s a type of pivot_table()
I use all the time, which aggregates string values together into a single column using lambda
and join()
.
pivot = pd.pivot_table(df, index='committee_member',
values='item_identifier',
aggfunc=lambda x: '|'.join(str(v) for v in x))
Let’s explore the code that makes this possible.
import pandas as pd
filename = 'explodedSampleDataCommitteeMembers.csv'
df_1 = pd.read_csv(filename, header=0)
pivot = pd.pivot_table(df_1, index='committee_member',
values='item_identifier',
aggfunc=lambda x: '|'.join(str(v) for v in x))
df_p = pd.DataFrame(pivot)
df_p = df_p.reset_index()
print(df_p.head())
df_p.to_csv('pivotedByCommitteeMembers.csv', index=False)
committee_member item_identifier
0 Achinstein, Sharon thesis_036
1 Allan, Bentley B. thesis_019
2 Angelini, Alessandro thesis_049
3 Arora, Raman thesis_087|thesis_087
4 Athreya, Avanti thesis_072
What is lambda?¶
New function
lambda
: An anonymous (unnamed) function that applies arguments to various parameters and returns an expression (outcome).
Let’s take a closer look at what lambda is doing in the example above. In particular, let’s look at the line below.
lambda x: '|'.join(str(v) for v in x)
This line of code is essentially a small function to transform a list into a string. In plain language, this function says:
For a list named x:
Loop through x and convert individual list components (named v) into strings
Convert list x into one string with components separated by '|' (pipes)
If we wrote it out it as a regular function, it would look like this:
def lambda(x):
new_x = []
for v in x:
v = str(v)
new_x.append(v)
'|'.join(new_x)
x = new_x
return x
We can also imagine it as a list comprehension.
x = [str(v) for v in x]
Melting!¶
Another great way to reshape a DataFrame
is through the melt()
function. This function (which serves as a type of un-pivoting) changes your DataFrame
from a “wide” to “long” format, reducing your DataFrame
to only two non-identifier columns: a variable column and a value column.
New function
melt()
: Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
(For Linked Data geeks: Maybe this is nonsense, but I find it helpful to visualize a melted DataFrame
as “What if each row in my spreadsheet was a triple?” )
It’s easiest to understand this through an example.
Files
You can download marylandPlaces.csv here
.
marylandPlaces (not melted)
local_identifier |
name |
fast |
geonames |
viaf |
---|---|---|---|---|
geo_001 |
Maryland |
|||
geo_002 |
Maryland–Annapolis |
|||
geo_003 |
Maryland–Baltimore |
marylandPlaces (melted)
local_identifier |
variable |
value |
---|---|---|
geo_001 |
name |
Maryland |
geo_001 |
fast |
|
geo_001 |
geonames |
|
geo_001 |
viaf |
|
geo_002 |
name |
Maryland–Annapolis |
geo_002 |
fast |
|
geo_002 |
geonames |
|
geo_002 |
viaf |
|
geo_003 |
name |
Maryland–Baltimore |
geo_003 |
fast |
|
geo_003 |
geonames |
|
geo_003 |
viaf |
Let’s explore the code that makes this possible.
import pandas as pd
filename = 'marylandPlaces.csv'
df_1 = pd.read_csv(filename, header=0)
df_1 = df_1.melt(id_vars=['local_identifier'])
print(df_1.head())
df_1.to_csv('meltedMarylandPlaces.csv')
local_identifier variable \
0 geo_001 name
1 geo_002 name
2 geo_003 name
3 geo_001 fast
4 geo_002 fast
value
0 Maryland
1 Maryland--Annapolis
2 Maryland--Baltimore
3 http://id.worldcat.org/fast/1204739
4 http://id.worldcat.org/fast/1206088