Fix Python – Efficient way to apply multiple filters to pandas DataFrame or Series

Question

Asked By – durden2.0

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.

  • The filters should be additive (aka each one applied should narrow results).
  • I’m currently using reindex() (as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame.
  • I’m thinking that using apply(), map(), or something similar might be better. I’m pretty new to Pandas though so still trying to wrap my head around everything.
  • Also, I would like to expand this so that the dictionary passed in can include the columns to operate on and filter an entire DataFrame based on the input dictionary. However, I’m assuming whatever works for a Series can be easily expanded to a DataFrame.

TL;DR

I want to take a dictionary of the following form and apply each operation to a given Series object and return a ‘filtered’ Series object.

relops = {'>=': [1], '<=': [1]}

Long Example

I’ll start with an example of what I have currently and just filtering a single Series object. Below is the function I’m currently using:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

The user provides a dictionary with the operations they want to perform:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Again, the ‘problem’ with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.

Now we will see solution for issue: Efficient way to apply multiple filters to pandas DataFrame or Series


Answer

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

This question is answered By – Andy Hayden

This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0