PyArrow + GCSFS not loading data when using filters...

Kelton Halbert Sun, 09 Jan 2022 15:46:27 -0800

Hello - I’m not sure if this is a bug, or if I’m not using the API correctly, 
but I have a partitioned parquet dataset stored on a Google Cloud Bucket that I 
am attempting to load for analysis. However, when applying filters to the 
dataset (using both the pyarrow.dataset and pyarrow.parquet.ParquetDataset 
APIs), I receive empty data frames and tables.


Here is my sample code:

import matplotlib.pyplot as plt
import pyarrow.dataset as ds
import numpy as np
import gcsfs
import pyarrow.parquet as pq

fs = gcsfs.GCSFileSystem()
data = pq.ParquetDataset("global-radiosondes/hires_sonde", filesystem=fs,
                         partitioning=["year", "month", "day", "hour", "site"], 
                         use_legacy_dataset=False, 
                         filters=[
                            ('year', '=', '2022'),
                            ('month', '=', '01'),
                            ('day', '=', '09'),
                            ('hour', '=', '12')])
table = data.read(columns=["pres", "hght"])
df = table.to_pandas()
print(df)

With the following output: 
Empty DataFrame
Columns: [pres, hght]
Index: []


Am I applying this incorrectly somehow? Any help would be appreciated. Again, 
the same issue happens when using the pyarrow.dataset API to load as well. The 
data bucket is public, so feel free to experiment. If I load the whole dataset 
into a pandas data frame, it works fine. Issue seems to be the filtering.

Thanks,
Kelton.

PyArrow + GCSFS not loading data when using filters...

Reply via email to