2. Clean up with pandas
Contents
2. Clean up with pandas¶
Drop blank columns or rows¶
Someone once asked if I could help clean-up an ArchivesSpace export with approximately 8000 blank rows and 70 blank columns randomly throughout the spreadsheet. Anyways, luckily pandas came to the rescue!
In this example, we are first going to drop any blank column or rows from our sampleData.csv
spreadsheet. First, let’s read our CSV as DataFrame
.
import pandas as pd
filename = 'sampleData.csv'
df = pd.read_csv(filename)
print(df.head())
item_identifier advisor creator \
0 1.0 Wolberger, Cynthia Daniels, Casey M.
1 2.0 Fallin, Margaret Daniele Collado Torres, Leonardo
2 3.0 Neelon, Sara E. Benjamin Caswell, Bess L. L.
3 4.0 Roter, Debra L. Jamal, Leila
4 5.0 Sears, Cynthia L Rouhani, Saba
date_issued title \
0 2015-07-31 Characterization of the ADP-ribosylated proteo...
1 2016-07-25 Annotation-Agnostic Differential Expression an...
2 2016-09-07 Child diet over three seasons in rural Zambia:...
3 2017-02-02 Exploring Parental Involvement in Rare Disease...
4 2018-05-23 Gut Microbes, Enteropathy and Child Growth: Th...
degree_discipline \
0 Biochemistry
1 Biostatistics
2 Human Nutrition
3 Bioethics
4 Global Disease Epidemiology & Control
degree_grantor \
0 Johns Hopkins University. Bloomberg School of ...
1 Johns Hopkins University. Bloomberg School of ...
2 Johns Hopkins University. Bloomberg School of ...
3 Johns Hopkins University. Bloomberg School of ...
4 Johns Hopkins University. Bloomberg School of ...
degree_department \
0 Biochemistry and Molecular Biology
1 Biostatistics
2 International Health
3 Health Policy and Management
4 International Health
committee_member contributor_author
0 Leung, Anthony K. L.|Matunis, Michael J.|Dingl... NaN
1 Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale... NaN
2 West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,... NaN
3 Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan... NaN
4 Kosek, Margaret N|Dowdy, David W|Sack, David ... NaN
Next, we are going to use a pandas function called dropna()
to remove the blank columns and rows.
New function
df.dropna()
: Removes missing values from the DataFrame.
In pandas, ‘na’ refers to missing values in a DataFrame
or Series.
This includes:
blanks (This is the equivalent of a blank cell in a spreadsheet.)
None
NaN
ornan
(This means Not a Number.)
‘na’ does not include:
False
""
(empty strings)0
In this script, we are using the function isna()
to show us which values are ‘na’. If the value is ‘na’, we will see True
printed. If the value is not ‘na’, we will see False
printed. There is also a function called notna()
which returns the inverse values of isna()
.
New function
pd.isna()
: Detects missing values for an array-like object.
New function
pd.notna()
: Detects non-missing values for an array-like object.
print(df.isna())
item_identifier advisor creator date_issued title degree_discipline \
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
.. ... ... ... ... ... ...
85 False False False False False False
86 False False False False False False
87 False False False False False False
88 False False False False False False
89 False False False False False False
degree_grantor degree_department committee_member contributor_author
0 False False False True
1 False False False True
2 False False False True
3 False False False True
4 False False False True
.. ... ... ... ...
85 False False False True
86 False False False True
87 False False False True
88 False False False True
89 False False False True
[90 rows x 10 columns]
While this is helpful, it doesn’t show all of our rows. To see all of our data, let’s loop through the rows in our DataFrame
. To do this, we will use a function called iterrows()
. For each loop, this function returns the index
and the row
as a Series
.
New function
df.iterrows()
: Iterates over DataFrame rows as (index, Series) pairs.
for index, row in df.iterrows():
print(row.notna())
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 0, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 1, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 2, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 3, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 4, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 5, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 6, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 7, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 8, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 9, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 10, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 11, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 12, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 13, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 14, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 15, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 16, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 17, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 18, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 19, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 20, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 21, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 22, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 23, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 24, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 25, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 26, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 27, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 28, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 29, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 30, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 31, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 32, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 33, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 34, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 35, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 36, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 37, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 38, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 39, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 40, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 41, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 42, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 43, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 44, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 45, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 46, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 47, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 48, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 49, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 50, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 51, dtype: bool
item_identifier False
advisor False
creator False
date_issued False
title False
degree_discipline False
degree_grantor False
degree_department False
committee_member False
contributor_author False
Name: 52, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 53, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 54, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 55, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 56, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 57, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 58, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 59, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 60, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member False
contributor_author False
Name: 61, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 62, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 63, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 64, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 65, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 66, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 67, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 68, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 69, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 70, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 71, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 72, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 73, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 74, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 75, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 76, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 77, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 78, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 79, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 80, dtype: bool
item_identifier True
advisor False
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 81, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 82, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 83, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 84, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 85, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 86, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 87, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 88, dtype: bool
item_identifier True
advisor True
creator True
date_issued True
title True
degree_discipline True
degree_grantor True
degree_department True
committee_member True
contributor_author False
Name: 89, dtype: bool
This is more helpful! In the snippet above, we see that contributor_author
Series
in rows 0 and 1 are False
or doesn’t have a value. If we look through the rest of our results, it seems like the entire contributor_author
Series
is empty. We’ll want to drop this column. Also, in our terminal results we’ll notice that all of the values in row 51 are False
or empty. We’ll want to drop that row as well.
So in df.dropna()
, we are asking to drop (or remove) all of empty rows and columns from our DataFrame
.
If you look at the function guidelines, you can see there are parameters to customize our use of the function.
First, we can pick what axis we want to df.dropna()
from.
axis=0
refers to the rowsaxis=1
refers to the columns/Series
Second, we can decide when we want to drop a row or a column from a DataFrame
. We can delete a row or column when:
how='any'
(At least one ‘na’ value is in the column or row.)thresh=x
(At least x number of ‘na’ are in the column or row.)how='all'
(All the values in the column or row are ‘na’.)
Since we want to delete only columns and rows that are completely blank or empty, let’s run the function twice with the parameter how='all'
. Once for the blank rows, and once for the blank columns.
df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')
print(df.head())
print(df.shape)
item_identifier advisor creator \
0 1.0 Wolberger, Cynthia Daniels, Casey M.
1 2.0 Fallin, Margaret Daniele Collado Torres, Leonardo
2 3.0 Neelon, Sara E. Benjamin Caswell, Bess L. L.
3 4.0 Roter, Debra L. Jamal, Leila
4 5.0 Sears, Cynthia L Rouhani, Saba
date_issued title \
0 2015-07-31 Characterization of the ADP-ribosylated proteo...
1 2016-07-25 Annotation-Agnostic Differential Expression an...
2 2016-09-07 Child diet over three seasons in rural Zambia:...
3 2017-02-02 Exploring Parental Involvement in Rare Disease...
4 2018-05-23 Gut Microbes, Enteropathy and Child Growth: Th...
degree_discipline \
0 Biochemistry
1 Biostatistics
2 Human Nutrition
3 Bioethics
4 Global Disease Epidemiology & Control
degree_grantor \
0 Johns Hopkins University. Bloomberg School of ...
1 Johns Hopkins University. Bloomberg School of ...
2 Johns Hopkins University. Bloomberg School of ...
3 Johns Hopkins University. Bloomberg School of ...
4 Johns Hopkins University. Bloomberg School of ...
degree_department \
0 Biochemistry and Molecular Biology
1 Biostatistics
2 International Health
3 Health Policy and Management
4 International Health
committee_member
0 Leung, Anthony K. L.|Matunis, Michael J.|Dingl...
1 Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...
2 West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...
3 Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...
4 Kosek, Margaret N|Dowdy, David W|Sack, David ...
(89, 9)
Let’s check our terminal results. We should have a DataFrame
with 89 rows and 9 columns.
Drop duplicate rows¶
Another common problem in metadata work is duplicate rows. Sometimes, exports will produce duplicated rows and it can be really laborious to remove these manually.
That’s where drop_duplicates()
comes in. This wonderful little pandas function deletes any rows with the same exact information.
New function
df.drop_duplicates()
: Removes duplicate rows from DataFrame.
But, first let’s remind ourselves what our spreadsheet 'sampleData.csv'
currently looks like.
print(df.head())
print(df.shape)
item_identifier advisor creator \
0 1.0 Wolberger, Cynthia Daniels, Casey M.
1 2.0 Fallin, Margaret Daniele Collado Torres, Leonardo
2 3.0 Neelon, Sara E. Benjamin Caswell, Bess L. L.
3 4.0 Roter, Debra L. Jamal, Leila
4 5.0 Sears, Cynthia L Rouhani, Saba
date_issued title \
0 2015-07-31 Characterization of the ADP-ribosylated proteo...
1 2016-07-25 Annotation-Agnostic Differential Expression an...
2 2016-09-07 Child diet over three seasons in rural Zambia:...
3 2017-02-02 Exploring Parental Involvement in Rare Disease...
4 2018-05-23 Gut Microbes, Enteropathy and Child Growth: Th...
degree_discipline \
0 Biochemistry
1 Biostatistics
2 Human Nutrition
3 Bioethics
4 Global Disease Epidemiology & Control
degree_grantor \
0 Johns Hopkins University. Bloomberg School of ...
1 Johns Hopkins University. Bloomberg School of ...
2 Johns Hopkins University. Bloomberg School of ...
3 Johns Hopkins University. Bloomberg School of ...
4 Johns Hopkins University. Bloomberg School of ...
degree_department \
0 Biochemistry and Molecular Biology
1 Biostatistics
2 International Health
3 Health Policy and Management
4 International Health
committee_member
0 Leung, Anthony K. L.|Matunis, Michael J.|Dingl...
1 Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...
2 West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...
3 Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...
4 Kosek, Margaret N|Dowdy, David W|Sack, David ...
(89, 9)
Awesome! It currently has 89 rows and 9 columns.
Ok, let’s take care of any duplicates! To start, we will check to see if our spreadsheet has duplicate rows by using the pandas function duplicate()
.
New function
df.duplicated()
: Returns boolean Series denoting duplicate rows.
This function returns Series
with the results for each row, identifying any duplicated rows as True
and any unique rows as False
.
Let’s try it out. We are setting the parameter keep
to False
to mark all duplicated rows, but you can also set this to "first"
or "last"
to mark the first or last instance of duplicate row as True
.
Since this function returns a Series
, we can look the results for all of the rows by looping through the Series
(we named it duplicates
) like a simple list using the enumerate()
function.
New function
enumerate()
: Returns a count (starting from zero) and associated value while iterating through a list-like object.
This loop prints the count
(equivalent to our row index) and then the row
value of the Series
.
duplicates = df.duplicated(keep=False)
for count, row in enumerate(duplicates):
print(count, row)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
32 False
33 False
34 False
35 False
36 False
37 False
38 False
39 False
40 False
41 False
42 False
43 False
44 False
45 False
46 False
47 False
48 False
49 False
50 False
51 False
52 False
53 False
54 False
55 False
56 False
57 False
58 False
59 False
60 False
61 False
62 False
63 False
64 False
65 False
66 False
67 False
68 False
69 False
70 False
71 False
72 False
73 False
74 False
75 False
76 False
77 False
78 False
79 False
80 False
81 False
82 False
83 False
84 False
85 True
86 True
87 False
88 False
From this, we can see that row 85 and row 86 are both True
or duplicated. This means we need to drop one of these rows to get rid of our duplicates.
If you were especially curious, you could print out these rows to see their values.
print(df.iloc[85])
print(df.iloc[86])
item_identifier 87.0
advisor Dredze, Mark
creator Benton, Adrian
date_issued 2018-10-25
title Learning Representations of Social Media Users
degree_discipline Computer Science
degree_grantor Johns Hopkins University. Whiting School of En...
degree_department Computer Science
committee_member Arora, Raman|Yarowsky, David|Hovy, Dirk
Name: 86, dtype: object
item_identifier 87.0
advisor Dredze, Mark
creator Benton, Adrian
date_issued 2018-10-25
title Learning Representations of Social Media Users
degree_discipline Computer Science
degree_grantor Johns Hopkins University. Whiting School of En...
degree_department Computer Science
committee_member Arora, Raman|Yarowsky, David|Hovy, Dirk
Name: 87, dtype: object
Yep, these rows are totally the same! So let’s get rid of that pesky extra row using drop_duplicates()
!
df = df.drop_duplicates()
print(df.shape)
(88, 9)
Here you can see that we now have 88 rows, not 89. Hooray!
String handling¶
You can also apply different functions to your DataFrame and your Series
by using string handling. (Here’s a full list of functions from the API documentation).
Let’s try using the following on our DataFrame.
New function
Series.apply()
: Invoke function on values of Series.
New function
Series.str.rstrip()
: Removes trailing characters.
New function
Series.str.zfill(width)
: Pads strings with zeros.
New function
Series.str.strip()
: Strips whitespaces from strings.
print(df['item_identifier'].head())
df['item_identifier'] = df['item_identifier'].apply(str)
print(df['item_identifier'].head())
df['item_identifier'] = df['item_identifier'].str.rstrip('.0')
print(df['item_identifier'].head())
df['item_identifier'] = df['item_identifier'].str.zfill(3)
print(df['item_identifier'].head())
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Name: item_identifier, dtype: float64
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Name: item_identifier, dtype: object
0 1
1 2
2 3
3 4
4 5
Name: item_identifier, dtype: object
0 001
1 002
2 003
3 004
4 005
Name: item_identifier, dtype: object
print(df['title'].head())
df['title'] = df['title'].str.strip()
print(df['title'].head())
0 Characterization of the ADP-ribosylated proteo...
1 Annotation-Agnostic Differential Expression an...
2 Child diet over three seasons in rural Zambia:...
3 Exploring Parental Involvement in Rare Disease...
4 Gut Microbes, Enteropathy and Child Growth: Th...
Name: title, dtype: object
0 Characterization of the ADP-ribosylated proteo...
1 Annotation-Agnostic Differential Expression an...
2 Child diet over three seasons in rural Zambia:...
3 Exploring Parental Involvement in Rare Disease...
4 Gut Microbes, Enteropathy and Child Growth: Th...
Name: title, dtype: object
Putting it all together¶
Now let’s make a script that cleans up our spreadsheet and creates a new cleaned up CSV.
New function
df.to_csv()
: Writes the DataFrame to a CSV file.
import pandas as pd
filename = 'sampleData.csv'
df = pd.read_csv(filename)
df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')
df = df.drop_duplicates()
df['item_identifier'] = df['item_identifier'].apply(str)
df['item_identifier'] = df['item_identifier'].str.rstrip('.0')
df['item_identifier'] = df['item_identifier'].str.zfill(3)
df['title'] = df['title'].str.strip()
print(df.head())
print(df.shape)
df.to_csv('sampleData_cleaned.csv', index=False)
item_identifier advisor creator \
0 001 Wolberger, Cynthia Daniels, Casey M.
1 002 Fallin, Margaret Daniele Collado Torres, Leonardo
2 003 Neelon, Sara E. Benjamin Caswell, Bess L. L.
3 004 Roter, Debra L. Jamal, Leila
4 005 Sears, Cynthia L Rouhani, Saba
date_issued title \
0 2015-07-31 Characterization of the ADP-ribosylated proteo...
1 2016-07-25 Annotation-Agnostic Differential Expression an...
2 2016-09-07 Child diet over three seasons in rural Zambia:...
3 2017-02-02 Exploring Parental Involvement in Rare Disease...
4 2018-05-23 Gut Microbes, Enteropathy and Child Growth: Th...
degree_discipline \
0 Biochemistry
1 Biostatistics
2 Human Nutrition
3 Bioethics
4 Global Disease Epidemiology & Control
degree_grantor \
0 Johns Hopkins University. Bloomberg School of ...
1 Johns Hopkins University. Bloomberg School of ...
2 Johns Hopkins University. Bloomberg School of ...
3 Johns Hopkins University. Bloomberg School of ...
4 Johns Hopkins University. Bloomberg School of ...
degree_department \
0 Biochemistry and Molecular Biology
1 Biostatistics
2 International Health
3 Health Policy and Management
4 International Health
committee_member
0 Leung, Anthony K. L.|Matunis, Michael J.|Dingl...
1 Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...
2 West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...
3 Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...
4 Kosek, Margaret N|Dowdy, David W|Sack, David ...
(88, 9)
Beautiful! You should have a lovely new spreadsheet called 'sampleData_cleaned.csv'
with 88 rows and 9 columns. ✨