Data manipulation with Pandas (python)

April 23, 2016


This is a summary for reference of the great Pandas libary, which offers efficient manipulation of spreadsheet-like data in python.

Data Structures

Series

ref

pd.Series([1,3,5,np.nan,6,8])                  #from list
pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])  #from array
pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})      #from dict

Dataframe

ref

pd.DataFrame(np.random.randn(6,4))                   #from ndarray
pd.DataFrame({'A':1.0,'B':pd.Timestamp('20130102')}) #from dict

Other ways to construct and other constructors (from_dict, from_records, from_items) exist.

Panel

ref. Container for 3-dimensional data. Axis names:

  • items: axis 0, each item corresponds to a DataFrame contained inside.
  • major_axis: axis 1, it is the index (rows) of each of the DataFrames
  • minor_axis: axis 2, it is the columns of each of the DataFrames

Summarizing and basic values

Summary:

df.info()
print(df)
to_string(df)      #tabular form

df.describe()      #itself returns a df
s.describe(percentiles=[.05, .25, .75, .95])
df.describe(include='all')    #for mixed type df-s

s.value_counts() 
pd.value_counts() 

df.head()          #
df.tail()          #

Values:

df.index           #row index, axis=0
df.columns         #col index, axis=1
df.values          #returns a numpy array

Indexes

Creating:

pd.Index(['e', 'd', 'a', 'b'])
pd.Index(['e', 'd', 'a', 'b'], name='something')
pd.DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04'],
              dtype='datetime64[ns]', freq='D', tz=None)

Multiindex

ref

pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']])
pd.MultiIndex.from_tuples([('bar','one'), ('bar','two')], names=['first', 'second'])
index.get_level_values(0)
index.get_level_values('second')

Setting

data.index = index      #set and add-hoc index
df.set_index('colB')    #make a given column (or several) an index
reset_index             #sets an integer index , previouly called delevel (deprecated)

Uniqueness:

df.index.is_unique

Index values of the Min/Max elements (note idxmin and idxmax are called argmin and argmax in NumPy)

s1.idxmin(), s1.idxmax()
df1.idxmin(axis=0)

Set operations

ref (note that + and -operators were deprecated for set operations on indexes).

a | b            #a.union(b)
a & b            #a.intersection(b)
a.difference(b)  
idx1 ^ idx2      #idx1.sym_diff(idx2)

Reindex

ref

s.reindex(['e', 'b', 'f', 'd'])     #if f was missing, it's added with NaN values
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
ts2.reindex(ts.index, method='ffill')   #Filling while reindexing]

reindex_like
align

Metadata:

  • Indexes are “mostly immutable”, but it is possible to set and change their metadata. You can use the rename, set_names, set_levels, and set_labels to set these attributes directly.

Types

The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns] and object.

df.dtypes
get_dtype_counts()                 #return the number of columns of each type
df.select_dtypes(include=[bool])
df.select_dtypes(include=['number', 'bool'])

Conversion Try to force conversion of types from the object dtype to other types

df.convert_objects(convert_numeric=True)
df['D'].astype('float16')
df['B'].astype('category', categories=[1,2,3])

To force conversion to datetime64[ns], pass convert_dates=’coerce’. This will convert any datetime-like object to dates, forcing other values to NaT.

Datetiem, timedelta

pd.date_range('20130101 09:10:12',periods=4)
pd.period_range('20130101', periods=4,freq='D')
pd.timedelta_range('1 day 00:00:05',periods=4,freq='s')

Selecting

[ ] and . operators

They provide intuitive dict-like or numpy-like syntax, but have optimization limits. For production code, it is recommended to take advantage of the optimized pandas data access methods below.

s['a']                      #dict-like
df['A']                     #select a column; returns Series
df.A                        #like df['A'], as long as colname is a valid identifier  
df[0:3]                     #row position selection
df['20130102':'20130104']   #row index selection
df[:,'A']                   #col index selection

Boolena indexing

ref. Operators: | (for or), & (for and), and ~ (for not)

s[s > 0]
s[~(s < 0)]
df[df['A'] > 0]
s[s.isin([2, 4, 6])]
s[s.index.isin([2, 4, 6])]
df.loc[:,df.loc['a']>0] 

By either label or position ; .ix operator

Note that when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.
For all .ix, .loc and .iloc, axes left out of the specification are assumed to be : ( p.loc[‘a’] is equiv to p.loc[‘a’, :, :]).
[deprecated: irow, icol, iget_value]

By label (.loc)

Note that both the start and end bound are included!

df.loc['a']                 #cross section, equiv to `df.xs('a')`
df.loc[:'c']                #index labels up to 'c' (included!)
df.loc[5]                   #index label=5 , not the 5th element.
df.loc[['a','b','d'],:]  

By position (.iloc)

Note that the upper bond is excluded !

Note that out of range slice indexes are handled gracefully (ie, return empty object instead of error). However a single indexer out of bounds will raise an IndexError.

df.iloc[1]                  #cross-section (second row), equiv to `df.xs(1)`
s.iloc[:3]                  #first and second elements
df.iloc[1:5,2:4]
df.iloc[[1,3,5],[1,3]]

Scalar value access (at(), iat()`)

ref

df.at['foo', 'A']          #label based
df.iat[3, 0]               #position based
df.at[dates[-1]+1, 'Z'] = 7  #appends 'Z', sets the given element, and the rest of cells in row and col are Nan

where() and mask() methods

ref. By default, where() returns a copy (there is a ‘inplace’ argument). It Returns an output the same shape as the original data. For df-s, equivalent to df[df < 0]

mask() is the inverse boolean operation of where().

s.where(s > 0)         #Includes NaNs. Compare with s[s > 0], that would skip NaN-s
df.where(df < 0)
df.where(df < 0, -df)  #Replaces False cases by the given 'other' argument
df.mask(df >= 0)       #returns Nan where the condition is met

get() method

ref Using the get method, a missing label will return None if the label does not exist, or the specified default:

s.get('x', default=-1)

select() & lookup() methods

The select method should be used only when there is no more direct way. select takes a function which operates on labels along axis and returns a boolean.

df.select(lambda x: x == 'A', axis=1)

The lookup method is a label-based “fancy indexing” function for DataFrame. Given equal-length arrays of row and column labels, return an array of the values corresponding to each (row, col) pair.

Datetime-like properties of series: dt

ref

s.dt.hour
s[s.dt.day==2]

assign() (dplyr’s mutate)

ref

iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
   ....:                                      x['SepalLength']))

query() (experimental)

ref Selection using an expression. A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common

df.query('(a < b) & (b < c)')
df.query('a < b < c')

Sepcial types of selections

Missing data

np.NaN, NaT, and None are considered missing by the isnull function.

pd.isnull(df)
df.isnull()           #boolean array
df.A.notnull()        #boolean array
df.dropna()           #return df without na values

Dealing with duplicate data

df.duplicated(['colA', 'colB'])    #boolean vector
df.drop_duplicated
s.nunique()

Random sample

Note that weigths can be assigned (with array list, ts or df column). random_state sets the seed.

s.sample()                    #one sampled row from series
s.sample(n=6, replace=True)   #6 sampels rows with replacement

Discretization and quantiling

pd.cut(s, 4)       #bins based
pd.qcut(s, 4)      #quantile based
pd.cut(arr, [-np.inf, 0, np.inf])

Modifying

Setting:
The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis.
Setting in general works by assignment after selection of an item or slice.

Rename

s.rename(str.upper)
df.rename(columns={'data':'Data', 'quality':'Quality'}, inplace=True)

Delete

del df['two']
df.drop(['a', 'd'], axis=0, inplace=True)      #remove col
df.pop('three')

Sort

df.sort_index(axis=1, ascending=False)   
df.sort(columns='B')
s.order()
s.nsmallest()
s.nlargest()

Arithmetics, logical operations, descriptive statistics

df.T.dot(df)       #matrix multiplication
s1.dot(s1)         #dot product of vectors
df[:5].T           #transpose

(df+df).equals(df*2)   #equals treats NaN-s as equals, as opposed to ==

df.count                #Number of non-null observations
df.sum                  #Sum of values
df.mean                 #Mean of values
df.mad                  #Mean absolute deviation
df.median               #Arithmetic median of values
df.min                  #Minimum
df.max                  #Maximum
df.mode                 #Mode
df.abs                  #Absolute Value
df.prod                 #Product of values
df.std                  #Unbiased standard deviation
df.var                  #Unbiased variance
df.sem                  #Unbiased standard error of the mean
df.df.skew              #Unbiased skewness (3rd moment)
df.kurt                 #Unbiased kurtosis (4th moment)
df.quantile             #Sample quantile (value at %)
df.cumsum               #Cumulative sum
df.cumprod              #Cumulative product
df.cummax               #Cumulative maximum
df.cummin               #Cumulative minimum

s.nunique()             #number of unique non-null values in a Series
s.value_counts()        #computes a histogram of a 1D array of values

Covariance and Correlation ref

s.cov(s2)               #covariance bwteen series
df.cov()                #covariance matrix
df['a'].corr(df['b'], method='spearman')    #correlation
df.rank(1)              #Rank columns

Moving (rolling) statistics ref

rolling_count       #Number of non-null observations
rolling_sum         #Sum of values
rolling_mean        #Mean of values
etc
rolling_window      #Moving window function

ewma                #Exponentially weighted (EW) moving average
ewmstd              #EW moving standard deviation
etc

Group by (split-apply-combine)

Group by

ref

Grouping

grouped = obj.groupby(key)
grouped.get_group('bar')
grouped.describe()

The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group

Iteration can be done with for name, group in grouped: ...

Aggregation

grouped.mean().sort('C')      #“dispatches” method calls to the groups
grouped.aggregate(np.sum)
grouped.size()

Transform
this retunrs an object that is indexed the same (same size) as the one being grouped

t = df.groupby('year').transform(lambda x: (x - x.mean()) / x.std())
t.groupby('year')   #this should have mean 0 and std=1

Filter returns a subset of the original object

sf.groupby(sf).filter(lambda x: x.sum() > 2)
dff.groupby('B').filter(lambda x: len(x) > 2)

Function application & itereation

Tablewise Function Application: pipe()

ref:

df.pipe(h).pipe(g, arg1=1).pipe(f, arg2=2, arg3=3) #equivalent to f(g(h(df), arg1=1), arg2=2, arg3=3)

Row or Column-wise Function Application: apply()

ref:

Apply arbitrary functions along the axes of a df or panel:

df.apply(np.mean)                     #'vertical'
df.apply(np.mean, axis=1)             #'horizontal'
df.apply(lambda x: x.max() - x.min())
df.apply(np.exp)
df.apply(pd.Series.interpolate)

Elementwise function application: applymap() and map()

ref:

The function applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value

df.applymap(lambda x: len(str(x)) )

Iteration

ref

for col in df: print(col)
for col, series in df.iteritems(): ...
for row_index, row in df2.iterrows(): ...
for r in df2.itertuples()                   #iterator yielding a tuple with all elements in each row

Vectorized string methods

ref Series is equipped with a set of string processing methods. Applies to columns as well.

s.str.lower()
df.columns.str.strip()
df.columns.str.replace(' ','_')

s.str.split('_')                     #returns tuples
s.str.split('_', expand=True)        #returns df
s.str.extract                        #uses regexps
s.str.replace                        #uses regexps
s.str.findall                        #uses regexps

and many other methods ref

Combining datasets

df1.combine_first(df2)   #combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame

I/O

df = pd.read_csv(‘.\data\ex_data1.csv’, parse_dates=True, index_col=0) etc

Code snipptes

References


Written with StackEdit.