Workshop home

Intro to Python: Pandas for Metadata Transformation and Cleanup
Hosted by Northeast Institutional Repository Day (NIRD)
March 2022

Introductions

Hi all, I’m Michelle Janowiecki (she/her). Since 2019, I’ve worked as metadata librarian at Johns Hopkins University Sheridan Libraries. I currently live in Baltimore with my dog and my husband.

Feel free to connect with me online: icons8-github-50 icons8-gitlab-50 icons8-twitter-50

Please feel introduce yourself in the chat if you feel comfortable! Share:

  • Your name and pronouns

  • Where you are zooming from

  • What you are hoping to learn

  • Your favorite animal

My favorite animals are manatees!

Overview of pandas

This workshop covers some basics of the pandas library. The pandas library is an open-source Python library that is very popular for data manipulation and analytics acrosss a wide array of displicines. Here are some tasks that pandas is great at:

  • Reading and writing data between different formats (CSV, JSON, XML, Excel, SQL, and more)

  • Merging and joining data

  • Reshaping and pivoting data

  • Handling missing data

  • Getting quick overviews of data values

  • Literally anything data analysis and visualization!

Workshop details

Schedule

Introductions and overview
10 minutes

1. Basics of Series and DataFrames
20 minutes

Concepts covered:

  • What is a DataFrame?

  • What is a Series?

  • How to create a DataFrame from a CSV

  • How to use of indexes and labels to evaluate and find data

Methods covered:

  • read_csv()

  • head() and tail()

  • column()

  • shape()

  • empty()

  • loc and iloc

  • at and iat

  • unique() and value_counts()

2. How to clean data
20 minutes

Concepts covered:

  • Missing values (‘na’)

  • Duplicates

  • String handling

  • How to write a CSV from a DataFrame

Methods covered:

  • isna() and notna()

  • duplicate()

  • dropna() and drop_duplicates()

  • iterrows()

  • apply(), str.rstrip(), str.zfill(), and str.strip()

3. How to merge data by identifiers or strings
20 minutes

Concepts covered:

  • Merging two DataFrames

  • Types of merges (left, right, inner, outer)

Methods covered:

  • merge()

4. How to reshape data
25 minutes

Concepts covered:

  • Exploding

  • Pivoting

  • Melting

  • Aggregating values

Methods covered:

  • explode

  • pivot()

  • pivot_table()

  • lambda

  • melt()

Additional resources and final questions
10 minutes


Workshop tips

  1. If you are lost, other people are lost, and your questions will help us all get less lost! Please ask away by typing in the chat. I’ll periodically pause to answer questions that come up.

  2. Please keep yourself muted unless talking to help us all hear.

  3. Take care of yourself. Whether that’s drinking lots of tea ☕, taking little breaks 💤, or turning off your camera for focusing purposes 💭, I’m totally cool with it.

Workshop tools and data

Tools

Versions

Python and pandas are continuously updated, so it’s important to know what version you need for your script to work. We are using the following versions for this workshop.

Jupyter Notebook Setup

This workshop is being run on Jupyter Notebook which is a open-source tool that lets you store and run Python code alongside text, images, and other types of documentation. You can download and run Jupyter Notebook on your own computer, but this instance is hosted on binder, which allows us to run Python 3 scripts directly in a browser. This setup was chosen for its minimal setup and teaching possiblities, but you don’t have to use Jupyter Notebook to use Python.

Alternative Python set-up

If you don’t want to use Jupyter Notebook for your future Python projects, check out this basic setup using Anaconda and Atom.

Data

I’m using slightly altered data for this workshop from Johns Hopkins University’s Electronic Thesis and Dissertations. Please re-use respectfully.

Supplemental files

Supplemental files