Data manipulation with Pandas (python)
April 23, 2016
This is a summary for reference of the great Pandas libary, which offers efficient manipulation of spreadsheet-like data in python.
Data Structures
Series
pd.Series([1,3,5,np.nan,6,8]) #from list
pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) #from array
pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}) #from dict
Dataframe
pd.DataFrame(np.random.randn(6,4)) #from ndarray
pd.DataFrame({'A':1.0,'B':pd.Timestamp('20130102')}) #from dict
Other ways to construct and other constructors (from_dict, from_records, from_items) exist.
Panel
ref. Container for 3-dimensional data. Axis names:
- items: axis 0, each item corresponds to a DataFrame contained inside.
- major_axis: axis 1, it is the index (rows) of each of the DataFrames
- minor_axis: axis 2, it is the columns of each of the DataFrames
Summarizing and basic values
Summary:
df.info()
print(df)
to_string(df) #tabular form
df.describe() #itself returns a df
s.describe(percentiles=[.05, .25, .75, .95])
df.describe(include='all') #for mixed type df-s
s.value_counts()
pd.value_counts()
df.head() #
df.tail() #
Values:
df.index #row index, axis=0
df.columns #col index, axis=1
df.values #returns a numpy array
Indexes
Creating:
pd.Index(['e', 'd', 'a', 'b'])
pd.Index(['e', 'd', 'a', 'b'], name='something')
pd.DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04'],
dtype='datetime64[ns]', freq='D', tz=None)
Multiindex
pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']])
pd.MultiIndex.from_tuples([('bar','one'), ('bar','two')], names=['first', 'second'])
index.get_level_values(0)
index.get_level_values('second')
Setting
data.index = index #set and add-hoc index
df.set_index('colB') #make a given column (or several) an index
reset_index #sets an integer index , previouly called delevel (deprecated)
Uniqueness:
df.index.is_unique
Index values of the Min/Max elements (note idxmin and idxmax are called argmin and argmax in NumPy)
s1.idxmin(), s1.idxmax()
df1.idxmin(axis=0)
Set operations
ref (note that +
and -
operators were deprecated for set operations on indexes).
a | b #a.union(b)
a & b #a.intersection(b)
a.difference(b)
idx1 ^ idx2 #idx1.sym_diff(idx2)
Reindex
s.reindex(['e', 'b', 'f', 'd']) #if f was missing, it's added with NaN values
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
ts2.reindex(ts.index, method='ffill') #Filling while reindexing]
reindex_like
align
Metadata:
- Indexes are “mostly immutable”, but it is possible to set and change their metadata. You can use the
rename
,set_names
,set_levels
, andset_labels
to set these attributes directly.
Types
The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns] and object.
df.dtypes
get_dtype_counts() #return the number of columns of each type
df.select_dtypes(include=[bool])
df.select_dtypes(include=['number', 'bool'])
Conversion Try to force conversion of types from the object dtype to other types
df.convert_objects(convert_numeric=True)
df['D'].astype('float16')
df['B'].astype('category', categories=[1,2,3])
To force conversion to datetime64[ns], pass convert_dates=’coerce’. This will convert any datetime-like object to dates, forcing other values to NaT.
Datetiem, timedelta
pd.date_range('20130101 09:10:12',periods=4)
pd.period_range('20130101', periods=4,freq='D')
pd.timedelta_range('1 day 00:00:05',periods=4,freq='s')
Selecting
[ ]
and .
operators
They provide intuitive dict-like or numpy-like syntax, but have optimization limits. For production code, it is recommended to take advantage of the optimized pandas data access methods below.
s['a'] #dict-like
df['A'] #select a column; returns Series
df.A #like df['A'], as long as colname is a valid identifier
df[0:3] #row position selection
df['20130102':'20130104'] #row index selection
df[:,'A'] #col index selection
Boolena indexing
ref. Operators: |
(for or
), &
(for and
), and ~
(for not
)
s[s > 0]
s[~(s < 0)]
df[df['A'] > 0]
s[s.isin([2, 4, 6])]
s[s.index.isin([2, 4, 6])]
df.loc[:,df.loc['a']>0]
By either label or position ; .ix
operator
Note that when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.
For all .ix, .loc and .iloc, axes left out of the specification are assumed to be :
( p.loc[‘a’] is equiv to p.loc[‘a’, :, :]).
[deprecated: irow, icol, iget_value]
By label (.loc
)
Note that both the start and end bound are included!
df.loc['a'] #cross section, equiv to `df.xs('a')`
df.loc[:'c'] #index labels up to 'c' (included!)
df.loc[5] #index label=5 , not the 5th element.
df.loc[['a','b','d'],:]
By position (.iloc
)
Note that the upper bond is excluded !
Note that out of range slice indexes are handled gracefully (ie, return empty object instead of error). However a single indexer out of bounds will raise an IndexError.
df.iloc[1] #cross-section (second row), equiv to `df.xs(1)`
s.iloc[:3] #first and second elements
df.iloc[1:5,2:4]
df.iloc[[1,3,5],[1,3]]
Scalar value access (at()
, iat()`)
df.at['foo', 'A'] #label based
df.iat[3, 0] #position based
df.at[dates[-1]+1, 'Z'] = 7 #appends 'Z', sets the given element, and the rest of cells in row and col are Nan
where()
and mask()
methods
ref.
By default, where()
returns a copy (there is a ‘inplace’ argument).
It Returns an output the same shape as the original data. For df-s, equivalent to df[df < 0]
mask()
is the inverse boolean operation of where()
.
s.where(s > 0) #Includes NaNs. Compare with s[s > 0], that would skip NaN-s
df.where(df < 0)
df.where(df < 0, -df) #Replaces False cases by the given 'other' argument
df.mask(df >= 0) #returns Nan where the condition is met
get()
method
ref Using the get method, a missing label will return None if the label does not exist, or the specified default:
s.get('x', default=-1)
select() & lookup() methods
The select
method should be used only when there is no more direct way. select takes a function which operates on labels along axis and returns a boolean.
df.select(lambda x: x == 'A', axis=1)
The lookup method is a label-based “fancy indexing” function for DataFrame. Given equal-length arrays of row and column labels, return an array of the values corresponding to each (row, col) pair.
Datetime-like properties of series: dt
s.dt.hour
s[s.dt.day==2]
assign() (dplyr’s mutate
)
iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
....: x['SepalLength']))
query()
(experimental)
ref Selection using an expression. A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common
df.query('(a < b) & (b < c)')
df.query('a < b < c')
Sepcial types of selections
Missing data
np.NaN
, NaT
, and None
are considered missing by the isnull function.
pd.isnull(df)
df.isnull() #boolean array
df.A.notnull() #boolean array
df.dropna() #return df without na values
Dealing with duplicate data
df.duplicated(['colA', 'colB']) #boolean vector
df.drop_duplicated
s.nunique()
Random sample
Note that weigths can be assigned (with array list, ts or df column). random_state sets the seed.
s.sample() #one sampled row from series
s.sample(n=6, replace=True) #6 sampels rows with replacement
Discretization and quantiling
pd.cut(s, 4) #bins based
pd.qcut(s, 4) #quantile based
pd.cut(arr, [-np.inf, 0, np.inf])
Modifying
Setting:
The .loc
/.ix
/[]
operations can perform enlargement when setting a non-existant key for that axis.
Setting in general works by assignment after selection of an item or slice.
Rename
s.rename(str.upper)
df.rename(columns={'data':'Data', 'quality':'Quality'}, inplace=True)
Delete
del df['two']
df.drop(['a', 'd'], axis=0, inplace=True) #remove col
df.pop('three')
Sort
df.sort_index(axis=1, ascending=False)
df.sort(columns='B')
s.order()
s.nsmallest()
s.nlargest()
Arithmetics, logical operations, descriptive statistics
df.T.dot(df) #matrix multiplication
s1.dot(s1) #dot product of vectors
df[:5].T #transpose
(df+df).equals(df*2) #equals treats NaN-s as equals, as opposed to ==
df.count #Number of non-null observations
df.sum #Sum of values
df.mean #Mean of values
df.mad #Mean absolute deviation
df.median #Arithmetic median of values
df.min #Minimum
df.max #Maximum
df.mode #Mode
df.abs #Absolute Value
df.prod #Product of values
df.std #Unbiased standard deviation
df.var #Unbiased variance
df.sem #Unbiased standard error of the mean
df.df.skew #Unbiased skewness (3rd moment)
df.kurt #Unbiased kurtosis (4th moment)
df.quantile #Sample quantile (value at %)
df.cumsum #Cumulative sum
df.cumprod #Cumulative product
df.cummax #Cumulative maximum
df.cummin #Cumulative minimum
s.nunique() #number of unique non-null values in a Series
s.value_counts() #computes a histogram of a 1D array of values
Covariance and Correlation ref
s.cov(s2) #covariance bwteen series
df.cov() #covariance matrix
df['a'].corr(df['b'], method='spearman') #correlation
df.rank(1) #Rank columns
Moving (rolling) statistics ref
rolling_count #Number of non-null observations
rolling_sum #Sum of values
rolling_mean #Mean of values
etc
rolling_window #Moving window function
ewma #Exponentially weighted (EW) moving average
ewmstd #EW moving standard deviation
etc
Group by (split-apply-combine)
Group by
Grouping
grouped = obj.groupby(key)
grouped.get_group('bar')
grouped.describe()
The groups
attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group
Iteration can be done with for name, group in grouped: ...
Aggregation
grouped.mean().sort('C') #“dispatches” method calls to the groups
grouped.aggregate(np.sum)
grouped.size()
Transform
this retunrs an object that is indexed the same (same size) as the one being grouped
t = df.groupby('year').transform(lambda x: (x - x.mean()) / x.std())
t.groupby('year') #this should have mean 0 and std=1
Filter returns a subset of the original object
sf.groupby(sf).filter(lambda x: x.sum() > 2)
dff.groupby('B').filter(lambda x: len(x) > 2)
Function application & itereation
Tablewise Function Application: pipe()
ref:
df.pipe(h).pipe(g, arg1=1).pipe(f, arg2=2, arg3=3) #equivalent to f(g(h(df), arg1=1), arg2=2, arg3=3)
Row or Column-wise Function Application: apply()
ref:
Apply arbitrary functions along the axes of a df or panel:
df.apply(np.mean) #'vertical'
df.apply(np.mean, axis=1) #'horizontal'
df.apply(lambda x: x.max() - x.min())
df.apply(np.exp)
df.apply(pd.Series.interpolate)
Elementwise function application: applymap()
and map()
ref:
The function applymap()
on DataFrame and analogously map()
on Series accept any Python function taking a single value and returning a single value
df.applymap(lambda x: len(str(x)) )
Iteration
for col in df: print(col)
for col, series in df.iteritems(): ...
for row_index, row in df2.iterrows(): ...
for r in df2.itertuples() #iterator yielding a tuple with all elements in each row
Vectorized string methods
ref Series is equipped with a set of string processing methods. Applies to columns as well.
s.str.lower()
df.columns.str.strip()
df.columns.str.replace(' ','_')
s.str.split('_') #returns tuples
s.str.split('_', expand=True) #returns df
s.str.extract #uses regexps
s.str.replace #uses regexps
s.str.findall #uses regexps
and many other methods ref
Combining datasets
df1.combine_first(df2) #combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame
I/O
df = pd.read_csv(‘.\data\ex_data1.csv’, parse_dates=True, index_col=0) etc
Code snipptes
…
References
- Most of the basic contents are directly taken form the official pandas documentation: http://pandas.pydata.org/pandas-docs/stable/index.html
- Python for Data Analysis
Written with StackEdit.