Hi Kelton,

I can reproduce the same error if I try to load all the data with data =
ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data =
pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs,
use_legacy_dataset=False).

Could you share your code, where you read a specific parquet file?

Best,
Alenka

On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <[email protected]>
wrote:

> Hello,
>
> I’ve been learning and working with PyArrow recently for a project to
> store some atmospheric science data as part of a partitioned dataset, and
> recently the dataset class with the  fsspec/gcsfs filesystem has started
> producing a new error. Unfortunately I cannot seem to track down what
> changed or if it’s an error on my end or not. I’m using PyArrow 7.0.0 and
> python 3.8.
>
> If I specify a specific parquet file, everything is fine - but if I give
> it any of the directory partitions, the same issue occurs. Any guidance
> here would be appreciated!
>
> The code:
> fs = gcsfs.GCSFileSystem(token="anon")
>
> partitioning = ds.HivePartitioning(
>         pyarrow.schema([
>             pyarrow.field('year', pyarrow.int32()),
>             pyarrow.field('month', pyarrow.int32()),
>             pyarrow.field('day', pyarrow.int32()),
>             pyarrow.field('hour', pyarrow.int32()),
>             pyarrow.field('WMO', pyarrow.string())
>         ])
> )
>
> schema = pyarrow.schema([
>     pyarrow.field('lon', pyarrow.float32()),
>     pyarrow.field('lat', pyarrow.float32()),
>     pyarrow.field('pres', pyarrow.float32()),
>     pyarrow.field('hght', pyarrow.float32()),
>     pyarrow.field('gpht', pyarrow.float32()),
>     pyarrow.field('tmpc', pyarrow.float32()),
>     pyarrow.field('dwpc', pyarrow.float32()),
>     pyarrow.field('relh', pyarrow.float32()),
>     pyarrow.field('uwin', pyarrow.float32()),
>     pyarrow.field('vwin', pyarrow.float32()),
>     pyarrow.field('wspd', pyarrow.float32()),
>     pyarrow.field('wdir', pyarrow.float32()),
>     pyarrow.field('year', pyarrow.int32()),
>     pyarrow.field('month', pyarrow.int32()),
>     pyarrow.field('day', pyarrow.int32()),
>     pyarrow.field('hour', pyarrow.int32()),
>     pyarrow.field('WMO', pyarrow.string())
> ])
>
> data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs,
> format="parquet", \
>                         partitioning=partitioning, schema=schema)
>
> subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")
>
> batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir",
> "year", "month", "day", "hour"], \
>                 use_threads=True)
>
> batches = list(batches)
>
> The error:
>
>     391 from pyarrow import PythonFile    393 if not self.fs.isfile(path):--> 
> 394     raise FileNotFoundError(path)    396 return 
> PythonFile(self.fs.open(path, mode="rb"), mode="r")
> FileNotFoundError: global-radiosondes/hires-sonde/
>
>
>

Reply via email to