Workshop home
Contents
Workshop home¶
Intro to Python: Pandas for Metadata Transformation and Cleanup
Hosted by Northeast Institutional Repository Day (NIRD)
March 2022
Introductions¶
Hi all, I’m Michelle Janowiecki (she/her). Since 2019, I’ve worked as metadata librarian at Johns Hopkins University Sheridan Libraries. I currently live in Baltimore with my dog and my husband.
Feel free to connect with me online:
Please feel introduce yourself in the chat if you feel comfortable! Share:
Your name and pronouns
Where you are zooming from
What you are hoping to learn
Your favorite animal
My favorite animals are manatees!
Overview of pandas¶
This workshop covers some basics of the pandas library. The pandas library is an open-source Python library that is very popular for data manipulation and analytics acrosss a wide array of displicines. Here are some tasks that pandas is great at:
Reading and writing data between different formats (CSV, JSON, XML, Excel, SQL, and more)
Merging and joining data
Reshaping and pivoting data
Handling missing data
Getting quick overviews of data values
Literally anything data analysis and visualization!
Workshop details¶
Schedule¶
Introductions and overview
10 minutes
1. Basics of Series and DataFrames
20 minutes
Concepts covered:
What is a DataFrame?
What is a Series?
How to create a DataFrame from a CSV
How to use of indexes and labels to evaluate and find data
Methods covered:
read_csv()
head()
andtail()
column()
shape()
empty()
loc
andiloc
at
andiat
unique()
andvalue_counts()
2. How to clean data
20 minutes
Concepts covered:
Missing values (‘na’)
Duplicates
String handling
How to write a CSV from a DataFrame
Methods covered:
isna()
andnotna()
duplicate()
dropna()
anddrop_duplicates()
iterrows()
apply()
,str.rstrip()
,str.zfill()
, andstr.strip()
3. How to merge data by identifiers or strings
20 minutes
Concepts covered:
Merging two DataFrames
Types of merges (left, right, inner, outer)
Methods covered:
merge()
4. How to reshape data
25 minutes
Concepts covered:
Exploding
Pivoting
Melting
Aggregating values
Methods covered:
explode
pivot()
pivot_table()
lambda
melt()
Additional resources and final questions
10 minutes
Workshop tips¶
If you are lost, other people are lost, and your questions will help us all get less lost! Please ask away by typing in the chat. I’ll periodically pause to answer questions that come up.
Please keep yourself muted unless talking to help us all hear.
Take care of yourself. Whether that’s drinking lots of tea ☕, taking little breaks 💤, or turning off your camera for focusing purposes 💭, I’m totally cool with it.
Workshop tools and data¶
Tools¶
Versions
Python and pandas are continuously updated, so it’s important to know what version you need for your script to work. We are using the following versions for this workshop.
Jupyter Notebook Setup
This workshop is being run on Jupyter Notebook which is a open-source tool that lets you store and run Python code alongside text, images, and other types of documentation. You can download and run Jupyter Notebook on your own computer, but this instance is hosted on binder, which allows us to run Python 3 scripts directly in a browser. This setup was chosen for its minimal setup and teaching possiblities, but you don’t have to use Jupyter Notebook to use Python.
Alternative Python set-up
If you don’t want to use Jupyter Notebook for your future Python projects, check out this basic setup using Anaconda and Atom.
Data¶
I’m using slightly altered data for this workshop from Johns Hopkins University’s Electronic Thesis and Dissertations. Please re-use respectfully.
Copyright¶
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.