pandas is a versatile and powerful library for data manipulation in Python. It’s an open-source tool that significantly enhances your ability to work with structured data. pandas is useful for a wide range of professionals, including data scientists, financial analysts, and anyone who needs to organize and analyze data efficiently.
This cheat sheet provides a detailed overview of essential pandas operations and functions.
To start using pandas, you need to import it:
import pandas as pd
DataFrames are the primary data structure in pandas. Here are different ways to create them:
# Creating a simple employee database
employee_data = {
'Name': ['Alice Wonder', 'Bob Builder', 'Charlie Chaplin', 'Diana Prince'],
'Department': ['IT', 'HR', 'Marketing', 'Finance'],
'Salary': [75000, 65000, 78000, 82000],
'Years of Experience': [3, 5, 2, 7]
}
df_employees = pd.DataFrame(employee_data)
print(df_employees)
This creates a neat table of employee information. You can almost hear HR sighing with relief!
# Creating a product inventory
products = [
{'name': 'Laptop', 'price': 1200, 'stock': 50},
{'name': 'Mouse', 'price': 25, 'stock': 100},
{'name': 'Keyboard', 'price': 50, 'stock': 75},
{'name': 'Monitor', 'price': 200, 'stock': 30}
]
df_inventory = pd.DataFrame(products)
print(df_inventory)
Perfect for when you have a list of similar items, like products in an inventory.
df = pd.read_csv('filename.csv')
This is how you’d typically load real-world data. It’s like opening a treasure chest of information!
Here are some basic operations you can perform on DataFrames:
print(df_employees.head(3)) *# First 3 rows*
print(df_employees.tail(2)) *# Last 2 rows*
print(df_employees.info()) *# DataFrame info*
print(df_employees.describe()) *# Summary statistics*
These commands give you a quick overview of your data.
*# Single column*
salaries = df_employees['Salary']
print(salaries)
*# Multiple columns*
name_and_dept = df_employees[['Name', 'Department']]
print(name_and_dept)
This is how you slice and dice your data. Want just the names? The salaries? You got it!
*# Adding a new column*
df_employees['Bonus'] = df_employees['Salary'] * 0.1
print(df_employees)
*# Removing a column*
df_employees_no_exp = df_employees.drop('Years of Experience', axis=1)
print(df_employees_no_exp)
Columns come and go, but the DataFrame remains. It’s data flexibility at its finest!
Here’s where pandas really shines—letting you pick the exact data morsels you want:
*# Employees with salary > 70000*
high_earners = df_employees[df_employees['Salary'] > 70000]
print(high_earners)
*# Employees in IT department with more than 2 years experience*
experienced_it = df_employees[(df_employees['Department'] == 'IT') & (df_employees['Years of Experience'] > 2)]
print(experienced_it)
This allows you to select data based on specific conditions.
*# loc: label-based selection*
print(df_employees.loc[1, 'Name']) *# Get name of the second employee# iloc: integer position-based selection*
print(df_employees.iloc[0, 2]) *# Get salary of the first employee*
loc and iloc are precise tools for data selection.
In the real world, data often comes with holes. Pandas helps you deal with them:
*# Let's introduce some missing data*
df_employees.loc[1, 'Salary'] = np.nan
df_employees.loc[3, 'Department'] = np.nan
*# Check for missing values*
print(df_employees.isnull().sum())
*# Drop rows with missing values*
df_clean = df_employees.dropna()
print(df_clean)
*# Fill missing values*
df_filled = df_employees.fillna({'Salary': df_employees['Salary'].mean(), 'Department': 'Unknown'})
print(df_filled)
These tools help you manage and clean datasets with missing values.
pandas offers various data transformation techniques:
*# Sort employees by salary, descending*
df_sorted = df_employees.sort_values('Salary', ascending=False)
print(df_sorted)
This allows you to order your data based on specific columns.
*# Average salary by department*
avg_salary = df_employees.groupby('Department')['Salary'].mean()
print(avg_salary)
*# Multiple aggregations*
dept_stats = df_employees.groupby('Department').agg({
'Salary': ['mean', 'max'],
'Years of Experience': 'mean'
})
print(dept_stats)
Grouping allows you to perform calculations on subsets of your data.
*# Let's create another DataFrame with department locations*
dept_locations = pd.DataFrame({
'Department': ['IT', 'HR', 'Marketing', 'Finance'],
'Location': ['Floor 1', 'Floor 2', 'Floor 3', 'Floor 2']
})
*# Merge with employee data*
df_merged = pd.merge(df_employees, dept_locations, on='Department')
print(df_merged)
Merging allows you to combine data from different DataFrames.
Here are some advanced pandas techniques:
*# Define a function to categorize salaries*
def salary_category(salary):
if salary < 70000:
return 'Low'
elif 70000 <= salary < 80000:
return 'Medium'
else:
return 'High'
*# Apply the function to create a new column*
df_employees['Salary Category'] = df_employees['Salary'].apply(salary_category)
print(df_employees)
This demonstrates how to apply custom functions to your data.
*# Create a pivot table of average salary by department and salary category*
pivot_table = pd.pivot_table(df_employees, values='Salary', index='Department',
columns='Salary Category', aggfunc='mean')
print(pivot_table)
Pivot tables are useful for summarizing and analyzing data.
*# Create a DataFrame with date index*
date_range = pd.date_range(start='2025-01-01', end='2025-12-31', freq='D')
time_series = pd.DataFrame({'Value': np.random.randn(len(date_range))}, index=date_range)
*# Resample to monthly frequency*
monthly_avg = time_series.resample('M').mean()
print(monthly_avg)
This shows how to work with time-series data in pandas.
After processing your data, you might want to save it:
*# Export to CSV*
df_employees.to_csv('processed_employees.csv', index=False)
*# Export to Excel*
df_employees.to_excel('employee_report.xlsx', sheet_name='Employee Data')
These commands allow you to save your data in different formats.
In conclusion, pandas is a comprehensive toolkit for data manipulation and analysis in Python. From importing data to performing complex analyses, it offers a wide range of functionalities. The key to mastering pandas is practice, so it’s recommended that you use it regularly with various datasets.