On 2020/11/11 18:17:32, Wes McKinney <[email protected]> wrote: 
> You should be able to use the kernels available in pyarrow.compute to
> do this -- there might be a few that are missing, but if you can't
> find what you need please open a Jira issue so it goes into the
> backlog
> 
> On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs <[email protected]> wrote:
> >
> > I do a lot of the following operation:
> >
> >     subframe = df[df['ID'] == k]
> >
> > where df is a Pandas DataFrame with a small number of columns but a 
> > moderately large number of rows (say 200K - 5M). The columns are usually 
> > simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, 
> > int64 VALUE.
> >
> > I am moving the source data to Parquet format. I don't really care whether 
> > I do this in PyArrow or Pandas, but I need to perform these subframe 
> > selections frequently and would like to speed them up. (The idea being, 
> > load the data into memory once, and then expect to perform subframe 
> > selection anywhere from 10 - 1000 times to extract appropriate data for 
> > further processing.)
> >
> > Is there a suggested method? Any ideas?
> >
> > I've tried
> >
> >     subframe = df.query('ID == %d' % k)
> >
> > and flirted with the idea of using Gandiva as per 
> > https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
> >  but it looks a bit rough + I had to manually tweak the types of literal 
> > constants to support something other than a float64.
> 

I'm a bit of a beginner in pyarrow, so I have something that kind of works (see 
https://stackoverflow.com/questions/64581590) but doesn't seem to be 
significantly more efficient than pandas filtering; Joris filed 
https://issues.apache.org/jira/browse/ARROW-10423.

Below is an example (DataFrame df0 is a 1.2M row dataset with some content 
having 200K rows and other content having 195 rows, which is a good mix of 
dense and sparse content) in Jupyter. I'm not sure how to copy+paste output so 
it is clearly delineating input and output; I've manually added ">>>" prompts 
to help. 

I guess the good news is that Pandas does pretty well on its own. The whole 
dataset takes about 24.5MB; a 10ms query represents 2.45GB/s brute force 
processing so I suppose I shouldn't complain too much.

>>> t = pa.Table.from_pandas(df0)
>>> t
pyarrow.Table
timestamp: int64
index: int32
value: int64
>>> import pyarrow.compute as pc
​>>> def select_by_index(table, ival):
    value_index = table.column('index')
    index_type = value_index.type.to_pandas_dtype()
    mask = pc.equal(value_index, index_type(ival))
    return table.filter(mask)
>>> %timeit t2 = select_by_index(t, 515)
2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit t2 = select_by_index(t, 3)
8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df0[df0['index'] == 515]
1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df0[df0['index'] == 3]
10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> print("ALL:%d, 3:%d, 515:%d" % (len(df0),
                                np.count_nonzero(df0['index'] == 3),
                                np.count_nonzero(df0['index'] == 515)))
ALL:1225000, 3:200000, 515:195
>>> df0.memory_usage()
Index            128
timestamp    9800000
index        4900000
value        9800000
dtype: int64


Reply via email to