# Usage of this Document
introduction to
- preparation for using pandas-library
- pandas commands & explanations
# Preparations
# Import Module
import Pandas module
import pandas as pd
# Manual
When writing an ?
behind a method, you get an explanation of what the specific method is doing.
There also is the option to write ??
behind a method for getting some source-code of the function.
Also writing help
gives information.
df.head?
df.head??
help(pd.DataFrame)
help(df_movies)
# Create DataFrame
create dataframe (from scratch) & pass it to pandas dataframe constructor
- {key : value}
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrames(data)
As you can see - the Index of this dataframe was given to us on creation as the numbers (0-3). We could also create own index when initializing the dataframe:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
# Add specific column-name
- …, columns=…
df_tweets = pd.DataFrame(tweets_lower, columns=["tweets"])
# Read data
It is quite simple to load data from various file formats into a dataframe.
# from CSV
note: a CSV doesn’t have an index like our dataframes
df = pd.read_csv('purchases.csv')
assign index:
df = pd.read_csv('purchases.csv', index_col=0)
# from JSON
is a stored python dict
- pandas can read this easily
df = pd.read_json('purchases.json')
# from SQL Database
First: establish connection using specific python library. Here, we use SQLite as DB.
Make sure to install pysqlite3
/ psycopg2
.
import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)
# set name as index
df = df.set_index('index')
# Convert back - CSV, JSON, SQL
# CSV
df.to_csv('new_purchases.csv')
# JSON
df.to_json('new_purchases.json')
# SQL
df.to_sql('new_purchases', con)
# Most important DataFrame Operations
# View Data
# .head() .tail()
print out a few rows as visual reference
.head()
: prints first 5 rows (per default).tail()
: prints last 5 rows (per default)
df.head()
df.tail()
# Getting Info about Data
# .info()
provides essential information about your dataset
- number of rows & columns
- number of non-null values
- type of data
- used-memory of df
df.info()
# Output:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 1000 non-null int64
1 Genre 1000 non-null object
2 Description 1000 non-null object
...
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB
# .type(…)
Get type of item.
type(df)
# Output: pandas.core.frame.DataFrame
# .shape()
gets dimensions of a dataframe
- it returns a tuple representing the number of rows & columns
df.shape()
# Output: (1000, 11)
# .ndim
Returns an int representing the number of axes / array dimensions. 1 = Series 2 = DataFrame
df.ndim
# Output: 2
# .append() — outdated
returns a copy of original dataframe + appended data
temp = df.append(df)
# .concat()
Combine two dataframes & makes sure that the index-values continue.
df_combined = pd.concat([df1, df2], ignore_index=True)
# Sort Output
# .sort_values()
df.sort_values(by="df_column")
# standard - ascending = True
df.sort_values(by="df_column", ascending=True)
# descending
df.sort_values(by="df_column", ascending=False)
# Handling Duplicates
# .drop_duplicates
returns a copy of dataframe, but removing duplicates
temp = temp.drop_duplicates()
temp = temp.drop_duplicates(inplace=True)
# removes duplicates, but applies this directly to df - not returning a copy
temp = temp.drop_duplicates(inplace=True, keep=first/last/False/)
# removes duplicates, but applies this directly to df - no copy -
# keeps first(default)/last/no one
# Column Cleanup
# .columns
prints column names of dataset.
df.columns
# Output:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')
# .rename(column={specific-column})
to rename specific columns.
df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
List/Dict Comprehension
df.columns = [col.lower() for col in movies_df]
# Working with missing Values
There are two options when dealing with null-values:
- Get rid of rows/columns with nulls
- Replace nulls with non-null values = “Imputation”
# .isnull()
returns a df, where each cell is either True/False (depending on cell’s null status).
df.isnull()
# To count number of nulls in each column we use an aggregate function for summing
df.isnull().sum()
.dropna()
deletes any row that contains a single null value.
it also returns a new df without changing the original one (inplace=True
could be used).
df.dropna()
# only drop columns with null values
df.dropna(axis=1)
# Imputation
used to keep valuable data that have null values.
# .fillna()
fills null values.
# extracting column from df & storing it in variable
revenue = movies_df['revenue_millions']
# filling missin values with mean value
revenue_mean = revenue.mean()
revenue.fillna(revenue_mean, inplace=True)
# Understand Variables
# .describe()
to get a summary of distribution of variables.
df.describe
# get description of specific variable
df['genre'].describe()
# example: count of all, mean-value, min-value, median, ...
# .value_counts()
tell us frequency of all values in a column.
df['genre'].value_counts().head(10)
# .corr()
‘correlation’ method. generate relationship between each continuous variable.
df.corr()
# DataFrame slicing, selecting, extracting
# Slicing of Columns
extract values:
genre_col = df['genre']
# this returns a Series
# - to extract a column as df - you need to pass a list of column names
genre_col = df[['genre']]
# Slicing of Rows
# .loc[]/.iloc[]
.loc[]
- locates by name
.iloc[]
- locates by numerical index
prom = df.loc["Prometheus"]
prom = df.iloc[1]