You should be able to use the kernels available in pyarrow.compute to
do this -- there might be a few that are missing, but if you can't
find what you need please open a Jira issue so it goes into the
backlog

On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs <[email protected]> wrote:
>
> I do a lot of the following operation:
>
>     subframe = df[df['ID'] == k]
>
> where df is a Pandas DataFrame with a small number of columns but a 
> moderately large number of rows (say 200K - 5M). The columns are usually 
> simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, 
> int64 VALUE.
>
> I am moving the source data to Parquet format. I don't really care whether I 
> do this in PyArrow or Pandas, but I need to perform these subframe selections 
> frequently and would like to speed them up. (The idea being, load the data 
> into memory once, and then expect to perform subframe selection anywhere from 
> 10 - 1000 times to extract appropriate data for further processing.)
>
> Is there a suggested method? Any ideas?
>
> I've tried
>
>     subframe = df.query('ID == %d' % k)
>
> and flirted with the idea of using Gandiva as per 
> https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
>  but it looks a bit rough + I had to manually tweak the types of literal 
> constants to support something other than a float64.

Reply via email to