# Usage of this Document
introduction to
- preparation for using pandas-library
- pandas commands & explanations
# Preparations
# Import Module
import Pandas module
# Manual
When writing an ?
behind a method, you get an explanation of what the specific method is doing.
There also is the option to write ??
behind a method for getting some source-code of the function.
Also writing help
gives information.
# Create DataFrame
create dataframe (from scratch) & pass it to pandas dataframe constructor
- {key : value}
As you can see - the Index of this dataframe was given to us on creation as the numbers (0-3). We could also create own index when initializing the dataframe:
# Add specific column-name
- …, columns=…
# Read data
It is quite simple to load data from various file formats into a dataframe.
# from CSV
note: a CSV doesn’t have an index like our dataframes
assign index:
# from JSON
is a stored python dict
- pandas can read this easily
# from SQL Database
First: establish connection using specific python library. Here, we use SQLite as DB.
Make sure to install pysqlite3
/ psycopg2
.
# Convert back - CSV, JSON, SQL
# Most important DataFrame Operations
# View Data
# .head() .tail()
print out a few rows as visual reference
.head()
: prints first 5 rows (per default).tail()
: prints last 5 rows (per default)
# Getting Info about Data
# .info()
provides essential information about your dataset
- number of rows & columns
- number of non-null values
- type of data
- used-memory of df
# .type(…)
Get type of item.
# .shape()
gets dimensions of a dataframe
- it returns a tuple representing the number of rows & columns
# .ndim
Returns an int representing the number of axes / array dimensions. 1 = Series 2 = DataFrame
# .append() — outdated
returns a copy of original dataframe + appended data
# .concat()
Combine two dataframes & makes sure that the index-values continue.
# Sort Output
# .sort_values()
# Handling Duplicates
# .drop_duplicates
returns a copy of dataframe, but removing duplicates
# Column Cleanup
# .columns
prints column names of dataset.
# .rename(column={specific-column})
to rename specific columns.
List/Dict Comprehension
# Working with missing Values
There are two options when dealing with null-values:
- Get rid of rows/columns with nulls
- Replace nulls with non-null values = “Imputation”
# .isnull()
returns a df, where each cell is either True/False (depending on cell’s null status).
.dropna()
deletes any row that contains a single null value.
it also returns a new df without changing the original one (inplace=True
could be used).
# Imputation
used to keep valuable data that have null values.
# .fillna()
fills null values.
# Understand Variables
# .describe()
to get a summary of distribution of variables.
# .value_counts()
tell us frequency of all values in a column.
# .corr()
‘correlation’ method. generate relationship between each continuous variable.
# DataFrame slicing, selecting, extracting
# Slicing of Columns
extract values:
# Slicing of Rows
# .loc[]/.iloc[]
.loc[]
- locates by name
.iloc[]
- locates by numerical index