3. Merge with pandas
Contents
3. Merge with pandas¶
Pandas has a lot of helpful ways to merge spreadsheets based on identifiers. In this lesson, we are going to learn how to use the pandas function merge()
and the four basic merge types:
left
right
outer
inner
New function
pd.merge()
: Merge DataFrame or named Series objects with a database-style join.
Types of merges¶
First, let’s talk about the type of merges. I find it really helpful to visualize the different types of merges, so we are going to discuss the four different types through examples. In these examples, we will merge two DataFrames
that we are calling frame_1
(the left frame) and frame_2
(the right frame) using the four main types of merges: left, right, outer, and inner.
Files
You can download frame_1.csv here
.
You can download frame_2.csv here
.
Attention
For now, we are just going to focus on the concepts and leave the coding for later in the lesson.
In pandas merges, which mimic traditional database merges in SQL, there will be a spreadsheet (or DataFrame
) we consider the “left” frame and a spreadsheet that we consider the “right” frame.
Selecting a frame to be “right” or “left” is arbitrary, we just have to be consistent as we consider our merges.
frame_1
(the left frame)
title |
subject_id |
---|---|
Overcoming Data Challenges in Machine Translation |
subject_001 |
Experimental Methods Towards Controlling the Glycoform |
subject_002 |
Interactions of DNA Polymerase Theta and Ku70/80 with Oxidative DNA Damage |
subject_003 |
Control and Learning of Dynamics in Human Movement |
subject_004 |
Local Public Health Performance and its Impact on Population Health |
subject_005 |
THz and microwave spectroscopy on strongly correlated conductors |
subject_007 |
Subjectivity in Flux: Youth in Latin American and Latino Literature |
subject_009 |
frame_2
(the right frame)
subject_id |
subject |
---|---|
subject_001 |
Machine learning |
subject_002 |
Biochemistry |
subject_003 |
DNA polymerases |
subject_004 |
Dynamics |
subject_006 |
Surgical emergencies |
subject_007 |
Superconductivity |
subject_008 |
Social conditions |
Left merge¶
A left merge only uses keys (i.e. identifiers) from the left frame. Similar to a SQL left outer join. What identifiers will be in the merged DataFrame
?
In other words, the left merge only matches on the identifiers found in the left frame:
subject_001
subject_002
subject_003
subject_004
subject_005
subject_007
subject_009
Here’s the frames merged together using a left merge.
title |
subject_id |
subject |
---|---|---|
Overcoming Data Challenges in Machine Translation |
subject_001 |
Machine learning |
Experimental Methods Towards Controlling the Glycoform |
subject_002 |
Biochemistry |
Interactions of DNA Polymerase Theta and Ku70/80 with Oxidative DNA Damage |
subject_003 |
DNA polymerases |
Control and Learning of Dynamics in Human Movement |
subject_004 |
Dynamics |
Local Public Health Performance and its Impact on Population Health |
subject_005 |
|
THz and microwave spectroscopy on strongly correlated conductors |
subject_007 |
Superconductivity |
Subjectivity in Flux: Youth in Latin American and Latino Literature |
subject_009 |
Right merge¶
A right merge only uses keys (i.e. identifiers) from the right frame. Similar to a SQL right outer join. What identifiers will be in the merged DataFrame
?
In other words, the right merge only matches on the identifiers found in the right frame:
subject_001
subject_002
subject_003
subject_004
subject_006
subject_007
subject_008
Here’s the frames merged together using a right merge.
title |
subject_id |
subject |
---|---|---|
Overcoming Data Challenges in Machine Translation |
subject_001 |
Machine learning |
Experimental Methods Towards Controlling the Glycoform |
subject_002 |
Biochemistry |
Interactions of DNA Polymerase Theta and Ku70/80 with Oxidative DNA Damage |
subject_003 |
DNA polymerases |
Control and Learning of Dynamics in Human Movement |
subject_004 |
Dynamics |
subject_006 |
Surgical emergencies |
|
THz and microwave spectroscopy on strongly correlated conductors |
subject_007 |
Superconductivity |
subject_008 |
Social conditions |
Outer merge¶
An outer merge uses union of keys from both frames. Similar to a SQL full outer join. In other words, the outer merge combines information from all identifiers in the frames. What identifiers will be in the merged DataFrame
?
All of them!
subject_001
subject_002
subject_003
subject_004
subject_005
subject_006
subject_007
subject_008
subject_009
Here’s the frames merged together using an outer merge.
title |
subject_id |
subject |
---|---|---|
Overcoming Data Challenges in Machine Translation |
subject_001 |
Machine learning |
Experimental Methods Towards Controlling the Glycoform |
subject_002 |
Biochemistry |
Interactions of DNA Polymerase Theta and Ku70/80 with Oxidative DNA Damage |
subject_003 |
DNA polymerases |
Control and Learning of Dynamics in Human Movement |
subject_004 |
Dynamics |
Local Public Health Performance and its Impact on Population Health |
subject_005 |
|
THz and microwave spectroscopy on strongly correlated conductors |
subject_007 |
Superconductivity |
Subjectivity in Flux: Youth in Latin American and Latino Literature |
subject_009 |
|
subject_006 |
Surgical emergencies |
|
subject_008 |
Social conditions |
Inner merge¶
An inner merge uses intersection of keys from both frames. Similar to a SQL inner join. In other words, the inner merge only matches on the identifiers found in both frames. What identifiers will be in the merged DataFrame
?
subject_001
subject_002
subject_003
subject_004
subject_007
Here’s the frames merged together using an inner merge.
title |
subject_id |
subject |
---|---|---|
Overcoming Data Challenges in Machine Translation |
subject_001 |
Machine learning |
Experimental Methods Towards Controlling the Glycoform |
subject_002 |
Biochemistry |
Interactions of DNA Polymerase Theta and Ku70/80 with Oxidative DNA Damage |
subject_003 |
DNA polymerases |
Control and Learning of Dynamics in Human Movement |
subject_004 |
Dynamics |
THz and microwave spectroscopy on strongly correlated conductors |
subject_007 |
Superconductivity |
Here’s another way to visual these types of merges:
Repeating keys and merges¶
Here’s a fun question. What happens during a merge when a key appears more than once in a DataFrame
?
Let’s find out by merging on department_id
for these two example DataFrames
.
Files
You can download frame_3.csv here
.
You can download frame_4.csv here
.
frame_3
(the left frame)
local_id |
name |
department_id |
---|---|---|
person_001 |
Ahuja, Allie |
department_003 |
person_002 |
Petrov, Stella |
department_003 |
person_003 |
Chen, Rachel |
department_002 |
person_004 |
Smith, Amber |
department_004 |
frame_4
(the right frame)
department_id |
department_name |
school |
---|---|---|
department_001 |
Cognitive Science |
Krieger School of Arts & Sciences |
department_002 |
Near Eastern Studies |
Krieger School of Arts & Sciences |
department_003 |
English |
Krieger School of Arts & Sciences |
department_004 |
Public Health |
Bloomberg School of Public Health |
Left merge¶
In the left merge, the identifiers from frame_3
(department_003, department_003, department_002, department_004) are all matched, in order, with the information from frame_4
.
local_id |
name |
department_id |
department_name |
school |
---|---|---|---|---|
person_001 |
Ahuja, Allie |
department_003 |
English |
Krieger School of Arts & Sciences |
person_002 |
Petrov, Stella |
department_003 |
English |
Krieger School of Arts & Sciences |
person_003 |
Chen, Rachel |
department_002 |
Near Eastern Studies |
Krieger School of Arts & Sciences |
person_004 |
Smith, Amber |
department_004 |
Public Health |
Bloomberg School of Public Health |
Right merge¶
In the right merge, the identifiers from frame_4
(department_001, department_002, department_003, department_004) are all matched, in order, with the information from frame_3
. However, you’ll notice that department_003 repeats here too, to capture both of the matches of department_003 in frame_3
.
local_id |
name |
department_id |
department_name |
school |
---|---|---|---|---|
department_001 |
Cognitive Science |
Krieger School of Arts & Sciences |
||
person_003 |
Chen, Rachel |
department_002 |
Near Eastern Studies |
Krieger School of Arts & Sciences |
person_001 |
Ahuja, Allie |
department_003 |
English |
Krieger School of Arts & Sciences |
person_002 |
Petrov, Stella |
department_003 |
English |
Krieger School of Arts & Sciences |
person_004 |
Smith, Amber |
department_004 |
Public Health |
Bloomberg School of Public Health |
Outer merge¶
In the outer merge, the identifiers from frame_3
(department_003, department_003, department_002, department_004) are all matched, in order, with the information from frame_4
. Then, the remaining identifier from frame_4
(department_001) is added to spreadsheet but doesn’t match with anything.
local_id |
name |
department_id |
department_name |
school |
---|---|---|---|---|
person_001 |
Ahuja, Allie |
department_003 |
English |
Krieger School of Arts & Sciences |
person_002 |
Petrov, Stella |
department_003 |
English |
Krieger School of Arts & Sciences |
person_003 |
Chen, Rachel |
department_002 |
Near Eastern Studies |
Krieger School of Arts & Sciences |
person_004 |
Smith, Amber |
department_004 |
Public Health |
Bloomberg School of Public Health |
department_001 |
Cognitive Science |
Krieger School of Arts & Sciences |
Inner merge¶
In the inner merge, the identifiers found in both frame_3
and frame_4
are matched together. This happens to be the same results as the left merge.
local_id |
name |
department_id |
department_name |
school |
---|---|---|---|---|
person_001 |
Ahuja, Allie |
department_003 |
English |
Krieger School of Arts & Sciences |
person_002 |
Petrov, Stella |
department_003 |
English |
Krieger School of Arts & Sciences |
person_003 |
Chen, Rachel |
department_002 |
Near Eastern Studies |
Krieger School of Arts & Sciences |
person_004 |
Smith, Amber |
department_004 |
Public Health |
Bloomberg School of Public Health |
Trying a merge¶
Let’s try a merge using pandas. For this exercise, we are going to use the sampleData_cleaned.csv
that we cleaned up previously and a spreadsheet with name authority URIs for the different Hopkins schools called schoolURIs.csv
.
Files
You can download sampleData_cleaned.csv here
.
You can download schoolURIs.csv here
.
degree_grantor |
viaf |
lcnaf |
---|---|---|
Johns Hopkins University. School of Medicine |
||
Johns Hopkins University. Krieger School of Arts and Sciences |
||
Johns Hopkins University. Whiting School of Engineering |
||
Johns Hopkins University. Bloomberg School of Public Health |
Deciding on the type of merge¶
First, lets decide on the type of merge. If we want a spreadsheet that has all of the information from sampleData_cleaned
with the additional location details from schoolURIs
, what type of merge should we do? What identifier should we use?
If we treat sampleData_cleaned
as the left frame and schoolURIs
as the right frame, we should do a left merge using the degree_grantor
as the key.
Why is this the best option? Let’s think through the results of the other merging options.
Right merge: If there are identifiers in the right frame that aren’t in the left frame, a right merge would include them and their associated data. However, we don’t need this data to meet our goal. Moreover, if there are identifiers in the left frame that aren’t in the right frame, an right merge would exclude them. Since we want to keep all of the data from sampleData_cleaned
, this isn’t a good option.
Outer merge: If there are identifiers in the right frame that aren’t in the left frame, an outer merge would include them and their associated data. However, we don’t need this data to meet our goal.
Inner merge: If there are identifiers in the left frame that aren’t in the right frame, an inner merge would exclude them. However, we want to keep all of the data from sampleData_cleaned
so this isn’t a good option.
Writing the code¶
Let’s look more closely at the merge()
function. The first two parameters are the left and rights objects to merge; this is where we want to put our DataFrames
. The next parameter is the how
parameter, and this is where we indicate what type of merge we want ("left"
, "right"
, "inner"
, or "outer"
). We want how="left"
for this merge.
Next, we need to use the on
parameter to tell the function what identifier (aka column/Series) we want to merge on. We need to set this to degree_grantor
.
Finally, lets set the suffixes. These are suffixes like _x
and _y
that are added to your column names if you have matching column names in your two DataFrames
. So, if you have a Name
column in df_1
and a Name
column in df_2
, the merged DataFrame
will have columns named Name_x
and Name_y
. I like to set these suffixes as _1
and _2
.
Great! Now let’s create a new CSV with our merged data.
import pandas as pd
df_1 = pd.read_csv('sampleData_cleaned.csv')
df_2 = pd.read_csv('schoolURIs.csv')
merged = pd.merge(df_1, df_2, how="left", on="degree_grantor")
print(merged.head())
merged.to_csv('sampleData_cleaned2.csv', index=False)
item_identifier advisor creator \
0 1 Wolberger, Cynthia Daniels, Casey M.
1 2 Fallin, Margaret Daniele Collado Torres, Leonardo
2 3 Neelon, Sara E. Benjamin Caswell, Bess L. L.
3 4 Roter, Debra L. Jamal, Leila
4 5 Sears, Cynthia L Rouhani, Saba
date_issued title \
0 2015-07-31 Characterization of the ADP-ribosylated proteo...
1 2016-07-25 Annotation-Agnostic Differential Expression an...
2 2016-09-07 Child diet over three seasons in rural Zambia:...
3 2017-02-02 Exploring Parental Involvement in Rare Disease...
4 2018-05-23 Gut Microbes, Enteropathy and Child Growth: Th...
degree_discipline \
0 Biochemistry
1 Biostatistics
2 Human Nutrition
3 Bioethics
4 Global Disease Epidemiology & Control
degree_grantor \
0 Johns Hopkins University. Bloomberg School of ...
1 Johns Hopkins University. Bloomberg School of ...
2 Johns Hopkins University. Bloomberg School of ...
3 Johns Hopkins University. Bloomberg School of ...
4 Johns Hopkins University. Bloomberg School of ...
degree_department \
0 Biochemistry and Molecular Biology
1 Biostatistics
2 International Health
3 Health Policy and Management
4 International Health
committee_member \
0 Leung, Anthony K. L.|Matunis, Michael J.|Dingl...
1 Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...
2 West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...
3 Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...
4 Kosek, Margaret N|Dowdy, David W|Sack, David ...
viaf \
0 http://viaf.org/viaf/142356070
1 http://viaf.org/viaf/142356070
2 http://viaf.org/viaf/142356070
3 http://viaf.org/viaf/142356070
4 http://viaf.org/viaf/142356070
lcnaf
0 http://id.loc.gov/authorities/names/nr2001039005
1 http://id.loc.gov/authorities/names/nr2001039005
2 http://id.loc.gov/authorities/names/nr2001039005
3 http://id.loc.gov/authorities/names/nr2001039005
4 http://id.loc.gov/authorities/names/nr2001039005