Hi Gary,
I'm not sure I fully understand the use-case, but it sounds like this might
be best handled by a join, which doesn't currently exist with Arrow (but I
think the functionality exists in Pandas?).

-Micah

On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gclark...@gmail.com> wrote:

> Hi all,
>
> I am looking for some suggestions on how I can speed up this analytical
> workload I am currently working on. This would really help me prove out a
> use case here for Apache Arrow to become our go to approach for data
> intensive operations.
>
> I have 2 datasets:
>
> dataset 1 has 1.5 million rows in 1 parquet file
> dataset 2 has 60 million rows in 8 parquet files within a directory
>
> for each row in dataset 1, there are several columns that become the
> conditions for how i filter for matching records in dataset 2. I am
> applying these filters upon reading in dataset 2. Here is my code:
>
> ```
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import time
>
> ds1= pq.read_table('dataset1.parquet')
>
> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>
> for i, r in ds1_df.iterrows():
>     start = time.time()
>     filters = [
>         ('timestamp_col', '>=', r.start_date),
>         ('timestamp_col', '<=', r.end_date),
>         ('condition1_non_null', '=', r.condition1),
>         ('condition2_with_nulls', '=', r.condition2)
>     ]
>
>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>
>     ds2_df= ds2.to_pandas()
>
>     r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
> ].nunique()
>
>     print(time.time() - start)
> ```
> Is this the fastest approach for doing something like this? I added some
> timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>
> One thought i had was to load dataset2 into memory once instead of each
> time, but then applying the filters is where i struggled to understand if
> PyArrow has this ability yet. One thing I noticed was in the filtering
> conditions I couldn't figure out how to filter out ACTUAL null values (like
> for condition 2 which has nulls), so instead, i created the parquet files
> where the NULLs come in as strings instead.
>
> Any guidance is appreciated. Thanks!
>
> --
> Gary Clark
> *Data Scientist & Data Engineer*
> *B.S. Mechanical Engineering, Howard University '13*
> +1 (717) 798-6916
> gclark...@gmail.com
>

Reply via email to