2. Clean up with pandas

Drop blank columns or rows

Someone once asked if I could help clean-up an ArchivesSpace export with approximately 8000 blank rows and 70 blank columns randomly throughout the spreadsheet. Anyways, luckily pandas came to the rescue!

In this example, we are first going to drop any blank column or rows from our sampleData.csv spreadsheet. First, let’s read our CSV as DataFrame.

import pandas as pd
filename = 'sampleData.csv'
df = pd.read_csv(filename)
print(df.head())
   item_identifier                   advisor                   creator  \
0              1.0        Wolberger, Cynthia         Daniels, Casey M.   
1              2.0  Fallin, Margaret Daniele  Collado Torres, Leonardo   
2              3.0  Neelon, Sara E. Benjamin       Caswell, Bess L. L.   
3              4.0           Roter, Debra L.              Jamal, Leila   
4              5.0          Sears, Cynthia L             Rouhani, Saba   

  date_issued                                              title  \
0  2015-07-31  Characterization of the ADP-ribosylated proteo...   
1  2016-07-25  Annotation-Agnostic Differential Expression an...   
2  2016-09-07  Child diet over three seasons in rural Zambia:...   
3  2017-02-02  Exploring Parental Involvement in Rare Disease...   
4  2018-05-23  Gut Microbes, Enteropathy and Child Growth: Th...   

                       degree_discipline  \
0                           Biochemistry   
1                          Biostatistics   
2                        Human Nutrition   
3                              Bioethics   
4  Global Disease Epidemiology & Control   

                                      degree_grantor  \
0  Johns Hopkins University. Bloomberg School of ...   
1  Johns Hopkins University. Bloomberg School of ...   
2  Johns Hopkins University. Bloomberg School of ...   
3  Johns Hopkins University. Bloomberg School of ...   
4  Johns Hopkins University. Bloomberg School of ...   

                    degree_department  \
0  Biochemistry and Molecular Biology   
1                       Biostatistics   
2                International Health   
3        Health Policy and Management   
4                International Health   

                                    committee_member  contributor_author  
0  Leung, Anthony K. L.|Matunis, Michael J.|Dingl...                 NaN  
1  Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...                 NaN  
2  West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...                 NaN  
3  Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...                 NaN  
4  Kosek, Margaret  N|Dowdy, David W|Sack, David ...                 NaN  

Next, we are going to use a pandas function called dropna() to remove the blank columns and rows.

New function

df.dropna(): Removes missing values from the DataFrame.

In pandas, ‘na’ refers to missing values in a DataFrame or Series. This includes:

  • blanks (This is the equivalent of a blank cell in a spreadsheet.)

  • None

  • NaN or nan (This means Not a Number.)

‘na’ does not include:

  • False

  • "" (empty strings)

  • 0

In this script, we are using the function isna() to show us which values are ‘na’. If the value is ‘na’, we will see True printed. If the value is not ‘na’, we will see False printed. There is also a function called notna()which returns the inverse values of isna().

New function

pd.isna(): Detects missing values for an array-like object.

New function

pd.notna(): Detects non-missing values for an array-like object.

print(df.isna())
    item_identifier  advisor  creator  date_issued  title  degree_discipline  \
0             False    False    False        False  False              False   
1             False    False    False        False  False              False   
2             False    False    False        False  False              False   
3             False    False    False        False  False              False   
4             False    False    False        False  False              False   
..              ...      ...      ...          ...    ...                ...   
85            False    False    False        False  False              False   
86            False    False    False        False  False              False   
87            False    False    False        False  False              False   
88            False    False    False        False  False              False   
89            False    False    False        False  False              False   

    degree_grantor  degree_department  committee_member  contributor_author  
0            False              False             False                True  
1            False              False             False                True  
2            False              False             False                True  
3            False              False             False                True  
4            False              False             False                True  
..             ...                ...               ...                 ...  
85           False              False             False                True  
86           False              False             False                True  
87           False              False             False                True  
88           False              False             False                True  
89           False              False             False                True  

[90 rows x 10 columns]

While this is helpful, it doesn’t show all of our rows. To see all of our data, let’s loop through the rows in our DataFrame. To do this, we will use a function called iterrows(). For each loop, this function returns the index and the row as a Series.

New function

df.iterrows(): Iterates over DataFrame rows as (index, Series) pairs.

for index, row in df.iterrows():
    print(row.notna())
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 0, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 1, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 2, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 3, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 4, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 5, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 6, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 7, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 8, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 9, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 10, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 11, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 12, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 13, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 14, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 15, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 16, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 17, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 18, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 19, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 20, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 21, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 22, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 23, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 24, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 25, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 26, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 27, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 28, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 29, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 30, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 31, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 32, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 33, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 34, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 35, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 36, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 37, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 38, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 39, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 40, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 41, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 42, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 43, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 44, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 45, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 46, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 47, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 48, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 49, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 50, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 51, dtype: bool
item_identifier       False
advisor               False
creator               False
date_issued           False
title                 False
degree_discipline     False
degree_grantor        False
degree_department     False
committee_member      False
contributor_author    False
Name: 52, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 53, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 54, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 55, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 56, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 57, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 58, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 59, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 60, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member      False
contributor_author    False
Name: 61, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 62, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 63, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 64, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 65, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 66, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 67, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 68, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 69, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 70, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 71, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 72, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 73, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 74, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 75, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 76, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 77, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 78, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 79, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 80, dtype: bool
item_identifier        True
advisor               False
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 81, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 82, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 83, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 84, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 85, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 86, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 87, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 88, dtype: bool
item_identifier        True
advisor                True
creator                True
date_issued            True
title                  True
degree_discipline      True
degree_grantor         True
degree_department      True
committee_member       True
contributor_author    False
Name: 89, dtype: bool

This is more helpful! In the snippet above, we see that contributor_author Series in rows 0 and 1 are False or doesn’t have a value. If we look through the rest of our results, it seems like the entire contributor_author Series is empty. We’ll want to drop this column. Also, in our terminal results we’ll notice that all of the values in row 51 are False or empty. We’ll want to drop that row as well.

So in df.dropna(), we are asking to drop (or remove) all of empty rows and columns from our DataFrame.

If you look at the function guidelines, you can see there are parameters to customize our use of the function.

First, we can pick what axis we want to df.dropna() from.

  • axis=0 refers to the rows

  • axis=1 refers to the columns/Series

Second, we can decide when we want to drop a row or a column from a DataFrame. We can delete a row or column when:

  • how='any' (At least one ‘na’ value is in the column or row.)

  • thresh=x (At least x number of ‘na’ are in the column or row.)

  • how='all'(All the values in the column or row are ‘na’.)

Since we want to delete only columns and rows that are completely blank or empty, let’s run the function twice with the parameter how='all'. Once for the blank rows, and once for the blank columns.

df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')
print(df.head())
print(df.shape)
   item_identifier                   advisor                   creator  \
0              1.0        Wolberger, Cynthia         Daniels, Casey M.   
1              2.0  Fallin, Margaret Daniele  Collado Torres, Leonardo   
2              3.0  Neelon, Sara E. Benjamin       Caswell, Bess L. L.   
3              4.0           Roter, Debra L.              Jamal, Leila   
4              5.0          Sears, Cynthia L             Rouhani, Saba   

  date_issued                                              title  \
0  2015-07-31  Characterization of the ADP-ribosylated proteo...   
1  2016-07-25  Annotation-Agnostic Differential Expression an...   
2  2016-09-07  Child diet over three seasons in rural Zambia:...   
3  2017-02-02  Exploring Parental Involvement in Rare Disease...   
4  2018-05-23  Gut Microbes, Enteropathy and Child Growth: Th...   

                       degree_discipline  \
0                           Biochemistry   
1                          Biostatistics   
2                        Human Nutrition   
3                              Bioethics   
4  Global Disease Epidemiology & Control   

                                      degree_grantor  \
0  Johns Hopkins University. Bloomberg School of ...   
1  Johns Hopkins University. Bloomberg School of ...   
2  Johns Hopkins University. Bloomberg School of ...   
3  Johns Hopkins University. Bloomberg School of ...   
4  Johns Hopkins University. Bloomberg School of ...   

                    degree_department  \
0  Biochemistry and Molecular Biology   
1                       Biostatistics   
2                International Health   
3        Health Policy and Management   
4                International Health   

                                    committee_member  
0  Leung, Anthony K. L.|Matunis, Michael J.|Dingl...  
1  Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...  
2  West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...  
3  Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...  
4  Kosek, Margaret  N|Dowdy, David W|Sack, David ...  
(89, 9)

Let’s check our terminal results. We should have a DataFrame with 89 rows and 9 columns.

Drop duplicate rows

Another common problem in metadata work is duplicate rows. Sometimes, exports will produce duplicated rows and it can be really laborious to remove these manually.

That’s where drop_duplicates() comes in. This wonderful little pandas function deletes any rows with the same exact information.

New function

df.drop_duplicates(): Removes duplicate rows from DataFrame.

But, first let’s remind ourselves what our spreadsheet 'sampleData.csv'currently looks like.

print(df.head())
print(df.shape)
   item_identifier                   advisor                   creator  \
0              1.0        Wolberger, Cynthia         Daniels, Casey M.   
1              2.0  Fallin, Margaret Daniele  Collado Torres, Leonardo   
2              3.0  Neelon, Sara E. Benjamin       Caswell, Bess L. L.   
3              4.0           Roter, Debra L.              Jamal, Leila   
4              5.0          Sears, Cynthia L             Rouhani, Saba   

  date_issued                                              title  \
0  2015-07-31  Characterization of the ADP-ribosylated proteo...   
1  2016-07-25  Annotation-Agnostic Differential Expression an...   
2  2016-09-07  Child diet over three seasons in rural Zambia:...   
3  2017-02-02  Exploring Parental Involvement in Rare Disease...   
4  2018-05-23  Gut Microbes, Enteropathy and Child Growth: Th...   

                       degree_discipline  \
0                           Biochemistry   
1                          Biostatistics   
2                        Human Nutrition   
3                              Bioethics   
4  Global Disease Epidemiology & Control   

                                      degree_grantor  \
0  Johns Hopkins University. Bloomberg School of ...   
1  Johns Hopkins University. Bloomberg School of ...   
2  Johns Hopkins University. Bloomberg School of ...   
3  Johns Hopkins University. Bloomberg School of ...   
4  Johns Hopkins University. Bloomberg School of ...   

                    degree_department  \
0  Biochemistry and Molecular Biology   
1                       Biostatistics   
2                International Health   
3        Health Policy and Management   
4                International Health   

                                    committee_member  
0  Leung, Anthony K. L.|Matunis, Michael J.|Dingl...  
1  Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...  
2  West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...  
3  Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...  
4  Kosek, Margaret  N|Dowdy, David W|Sack, David ...  
(89, 9)

Awesome! It currently has 89 rows and 9 columns.

Ok, let’s take care of any duplicates! To start, we will check to see if our spreadsheet has duplicate rows by using the pandas function duplicate().

New function

df.duplicated(): Returns boolean Series denoting duplicate rows.

This function returns Series with the results for each row, identifying any duplicated rows as True and any unique rows as False.

Let’s try it out. We are setting the parameter keep to False to mark all duplicated rows, but you can also set this to "first" or "last" to mark the first or last instance of duplicate row as True.

Since this function returns a Series, we can look the results for all of the rows by looping through the Series (we named it duplicates) like a simple list using the enumerate() function.

New function

enumerate(): Returns a count (starting from zero) and associated value while iterating through a list-like object.

This loop prints the count (equivalent to our row index) and then the row value of the Series.

duplicates = df.duplicated(keep=False)
for count, row in enumerate(duplicates):
    print(count, row)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
32 False
33 False
34 False
35 False
36 False
37 False
38 False
39 False
40 False
41 False
42 False
43 False
44 False
45 False
46 False
47 False
48 False
49 False
50 False
51 False
52 False
53 False
54 False
55 False
56 False
57 False
58 False
59 False
60 False
61 False
62 False
63 False
64 False
65 False
66 False
67 False
68 False
69 False
70 False
71 False
72 False
73 False
74 False
75 False
76 False
77 False
78 False
79 False
80 False
81 False
82 False
83 False
84 False
85 True
86 True
87 False
88 False

From this, we can see that row 85 and row 86 are both True or duplicated. This means we need to drop one of these rows to get rid of our duplicates.

If you were especially curious, you could print out these rows to see their values.

print(df.iloc[85])
print(df.iloc[86])
item_identifier                                                   87.0
advisor                                                   Dredze, Mark
creator                                                 Benton, Adrian
date_issued                                                 2018-10-25
title                   Learning Representations of Social Media Users
degree_discipline                                     Computer Science
degree_grantor       Johns Hopkins University. Whiting School of En...
degree_department                                     Computer Science
committee_member               Arora, Raman|Yarowsky, David|Hovy, Dirk
Name: 86, dtype: object
item_identifier                                                   87.0
advisor                                                   Dredze, Mark
creator                                                 Benton, Adrian
date_issued                                                 2018-10-25
title                   Learning Representations of Social Media Users
degree_discipline                                     Computer Science
degree_grantor       Johns Hopkins University. Whiting School of En...
degree_department                                     Computer Science
committee_member               Arora, Raman|Yarowsky, David|Hovy, Dirk
Name: 87, dtype: object

Yep, these rows are totally the same! So let’s get rid of that pesky extra row using drop_duplicates()!

df = df.drop_duplicates()

print(df.shape)
(88, 9)

Here you can see that we now have 88 rows, not 89. Hooray!

String handling

You can also apply different functions to your DataFrame and your Series by using string handling. (Here’s a full list of functions from the API documentation).

Let’s try using the following on our DataFrame.

New function

Series.apply(): Invoke function on values of Series.

New function

Series.str.rstrip(): Removes trailing characters.

New function

Series.str.zfill(width): Pads strings with zeros.

New function

Series.str.strip(): Strips whitespaces from strings.

print(df['item_identifier'].head())

df['item_identifier'] = df['item_identifier'].apply(str)
print(df['item_identifier'].head())

df['item_identifier'] = df['item_identifier'].str.rstrip('.0')
print(df['item_identifier'].head())

df['item_identifier'] = df['item_identifier'].str.zfill(3)

print(df['item_identifier'].head())
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
Name: item_identifier, dtype: float64
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
Name: item_identifier, dtype: object
0    1
1    2
2    3
3    4
4    5
Name: item_identifier, dtype: object
0    001
1    002
2    003
3    004
4    005
Name: item_identifier, dtype: object
print(df['title'].head())
df['title'] = df['title'].str.strip()
print(df['title'].head())
0    Characterization of the ADP-ribosylated proteo...
1    Annotation-Agnostic Differential Expression an...
2    Child diet over three seasons in rural Zambia:...
3    Exploring Parental Involvement in Rare Disease...
4    Gut Microbes, Enteropathy and Child Growth: Th...
Name: title, dtype: object
0    Characterization of the ADP-ribosylated proteo...
1    Annotation-Agnostic Differential Expression an...
2    Child diet over three seasons in rural Zambia:...
3    Exploring Parental Involvement in Rare Disease...
4    Gut Microbes, Enteropathy and Child Growth: Th...
Name: title, dtype: object

Putting it all together

Now let’s make a script that cleans up our spreadsheet and creates a new cleaned up CSV.

New function

df.to_csv(): Writes the DataFrame to a CSV file.

import pandas as pd
filename = 'sampleData.csv'
df = pd.read_csv(filename)


df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')
df = df.drop_duplicates()
df['item_identifier'] = df['item_identifier'].apply(str)
df['item_identifier'] = df['item_identifier'].str.rstrip('.0')
df['item_identifier'] = df['item_identifier'].str.zfill(3)
df['title'] = df['title'].str.strip()

print(df.head())
print(df.shape)

df.to_csv('sampleData_cleaned.csv', index=False)
  item_identifier                   advisor                   creator  \
0             001        Wolberger, Cynthia         Daniels, Casey M.   
1             002  Fallin, Margaret Daniele  Collado Torres, Leonardo   
2             003  Neelon, Sara E. Benjamin       Caswell, Bess L. L.   
3             004           Roter, Debra L.              Jamal, Leila   
4             005          Sears, Cynthia L             Rouhani, Saba   

  date_issued                                              title  \
0  2015-07-31  Characterization of the ADP-ribosylated proteo...   
1  2016-07-25  Annotation-Agnostic Differential Expression an...   
2  2016-09-07  Child diet over three seasons in rural Zambia:...   
3  2017-02-02  Exploring Parental Involvement in Rare Disease...   
4  2018-05-23  Gut Microbes, Enteropathy and Child Growth: Th...   

                       degree_discipline  \
0                           Biochemistry   
1                          Biostatistics   
2                        Human Nutrition   
3                              Bioethics   
4  Global Disease Epidemiology & Control   

                                      degree_grantor  \
0  Johns Hopkins University. Bloomberg School of ...   
1  Johns Hopkins University. Bloomberg School of ...   
2  Johns Hopkins University. Bloomberg School of ...   
3  Johns Hopkins University. Bloomberg School of ...   
4  Johns Hopkins University. Bloomberg School of ...   

                    degree_department  \
0  Biochemistry and Molecular Biology   
1                       Biostatistics   
2                International Health   
3        Health Policy and Management   
4                International Health   

                                    committee_member  
0  Leung, Anthony K. L.|Matunis, Michael J.|Dingl...  
1  Leek, Jeffrey T.|Hansen, Kasper D.|Battle, Ale...  
2  West, Keith P., Jr.|Talegawkar, Sameera|Fanzo,...  
3  Kass, Nancy E.|Saloner, Brendan|Bodurtha, Joan...  
4  Kosek, Margaret  N|Dowdy, David W|Sack, David ...  
(88, 9)

Beautiful! You should have a lovely new spreadsheet called 'sampleData_cleaned.csv' with 88 rows and 9 columns. ✨