Definitions


# Usage of this Document

introduction to

  • preparation for using pandas-library
  • pandas commands & explanations

# Preparations

# Import Module

import Pandas module

import pandas as pd

# Manual

When writing an ? behind a method, you get an explanation of what the specific method is doing. There also is the option to write ?? behind a method for getting some source-code of the function. Also writing help gives information.

df.head?
df.head??
 
help(pd.DataFrame)
help(df_movies)

# Create DataFrame

create dataframe (from scratch) & pass it to pandas dataframe constructor

  • {key : value}
data = {
		'apples': [3, 2, 0, 1],
		'oranges': [0, 3, 7, 2]
}
 
purchases = pd.DataFrames(data)

As you can see - the Index of this dataframe was given to us on creation as the numbers (0-3). We could also create own index when initializing the dataframe:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

# Add specific column-name

  • …, columns=…
df_tweets = pd.DataFrame(tweets_lower, columns=["tweets"])

# Read data

It is quite simple to load data from various file formats into a dataframe.

# from CSV

note: a CSV doesn’t have an index like our dataframes

df = pd.read_csv('purchases.csv')

assign index:

df = pd.read_csv('purchases.csv', index_col=0)

# from JSON

is a stored python dict - pandas can read this easily

df = pd.read_json('purchases.json')

# from SQL Database

First: establish connection using specific python library. Here, we use SQLite as DB.

Make sure to install pysqlite3 / psycopg2.

import sqlite3
 
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)
 
# set name as index
df = df.set_index('index')

# Convert back - CSV, JSON, SQL

# CSV
df.to_csv('new_purchases.csv')
# JSON
df.to_json('new_purchases.json')
# SQL
df.to_sql('new_purchases', con)

# Most important DataFrame Operations

# View Data

# .head() .tail()

print out a few rows as visual reference

  • .head(): prints first 5 rows (per default)
  • .tail(): prints last 5 rows (per default)
df.head()
df.tail()

# Getting Info about Data

# .info()

provides essential information about your dataset

  • number of rows & columns
  • number of non-null values
  • type of data
  • used-memory of df
df.info()
 
# Output:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object
	...
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

# .type(…)

Get type of item.

type(df)
 
# Output: pandas.core.frame.DataFrame 

# .shape()

gets dimensions of a dataframe

  • it returns a tuple representing the number of rows & columns
df.shape()
 
# Output: (1000, 11)

# .ndim

Returns an int representing the number of axes / array dimensions. 1 = Series 2 = DataFrame

df.ndim
# Output: 2

# .append() — outdated

returns a copy of original dataframe + appended data

temp = df.append(df)

# .concat()

Combine two dataframes & makes sure that the index-values continue.

df_combined = pd.concat([df1, df2], ignore_index=True)

# Sort Output

# .sort_values()

df.sort_values(by="df_column")
 
# standard - ascending = True
df.sort_values(by="df_column", ascending=True)
 
# descending
df.sort_values(by="df_column", ascending=False)

# Handling Duplicates

# .drop_duplicates

returns a copy of dataframe, but removing duplicates

temp = temp.drop_duplicates()
 
temp = temp.drop_duplicates(inplace=True)
# removes duplicates, but applies this directly to df - not returning a copy
 
temp = temp.drop_duplicates(inplace=True, keep=first/last/False/)
# removes duplicates, but applies this directly to df - no copy - 
# keeps first(default)/last/no one

# Column Cleanup

# .columns

prints column names of dataset.

df.columns
 
# Output: 
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

# .rename(column={specific-column})

to rename specific columns.

df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

List/Dict Comprehension

df.columns = [col.lower() for col in movies_df]

# Working with missing Values

There are two options when dealing with null-values:

  1. Get rid of rows/columns with nulls
  2. Replace nulls with non-null values = “Imputation

# .isnull()

returns a df, where each cell is either True/False (depending on cell’s null status).

df.isnull()
 
# To count number of nulls in each column we use an aggregate function for summing
df.isnull().sum()

.dropna()

deletes any row that contains a single null value. it also returns a new df without changing the original one (inplace=True could be used).

df.dropna()
 
# only drop columns with null values
df.dropna(axis=1)

# Imputation

used to keep valuable data that have null values.

# .fillna()

fills null values.

# extracting column from df & storing it in variable
revenue = movies_df['revenue_millions']
 
# filling missin values with mean value
revenue_mean = revenue.mean()
revenue.fillna(revenue_mean, inplace=True)

# Understand Variables

# .describe()

to get a summary of distribution of variables.

df.describe
 
# get description of specific variable
df['genre'].describe()
 
# example: count of all, mean-value, min-value, median, ...

# .value_counts()

tell us frequency of all values in a column.

df['genre'].value_counts().head(10)

# .corr()

‘correlation’ method. generate relationship between each continuous variable.

df.corr()

# DataFrame slicing, selecting, extracting

# Slicing of Columns

extract values:

genre_col = df['genre']
# this returns a Series
 
# - to extract a column as df - you need to pass a list of column names
genre_col = df[['genre']]

# Slicing of Rows

# .loc[]/.iloc[]

.loc[] - locates by name .iloc[] - locates by numerical index

prom = df.loc["Prometheus"]
 
prom = df.iloc[1]