Hello,

I’ve been learning and working with PyArrow recently for a project to store 
some atmospheric science data as part of a partitioned dataset, and recently 
the dataset class with the  fsspec/gcsfs filesystem has started producing a new 
error. Unfortunately I cannot seem to track down what changed or if it’s an 
error on my end or not. I’m using PyArrow 7.0.0 and python 3.8.

If I specify a specific parquet file, everything is fine - but if I give it any 
of the directory partitions, the same issue occurs. Any guidance here would be 
appreciated!

The code: 
fs = gcsfs.GCSFileSystem(token="anon")

partitioning = ds.HivePartitioning(
        pyarrow.schema([
            pyarrow.field('year', pyarrow.int32()),
            pyarrow.field('month', pyarrow.int32()),
            pyarrow.field('day', pyarrow.int32()),
            pyarrow.field('hour', pyarrow.int32()),
            pyarrow.field('WMO', pyarrow.string())
        ])
)

schema = pyarrow.schema([
    pyarrow.field('lon', pyarrow.float32()),
    pyarrow.field('lat', pyarrow.float32()),
    pyarrow.field('pres', pyarrow.float32()),
    pyarrow.field('hght', pyarrow.float32()),
    pyarrow.field('gpht', pyarrow.float32()),
    pyarrow.field('tmpc', pyarrow.float32()),
    pyarrow.field('dwpc', pyarrow.float32()),
    pyarrow.field('relh', pyarrow.float32()),
    pyarrow.field('uwin', pyarrow.float32()),
    pyarrow.field('vwin', pyarrow.float32()),
    pyarrow.field('wspd', pyarrow.float32()),
    pyarrow.field('wdir', pyarrow.float32()),
    pyarrow.field('year', pyarrow.int32()),
    pyarrow.field('month', pyarrow.int32()),
    pyarrow.field('day', pyarrow.int32()),
    pyarrow.field('hour', pyarrow.int32()),
    pyarrow.field('WMO', pyarrow.string())
])

data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs, 
format="parquet", \
                        partitioning=partitioning, schema=schema)

subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")

batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir", 
"year", "month", "day", "hour"], \
                use_threads=True)

batches = list(batches)

The error:
    391 from pyarrow import PythonFile
    393 if not self.fs.isfile(path):
--> 394     raise FileNotFoundError(path)
    396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: global-radiosondes/hires-sonde/

Reply via email to