Hi Albert, I think you are running into this error due to mismatch of IPC formats - not 100% sure but I tried locally and I get a very similar error when I intentionally mismatch.
It seems the file on GCS is in IPC File format and you are trying to read this using a reader designated for IPC Stream format (open_stream, RecordBatchStreamReader). Check out the pa.ipc.open_file() function. This returns RecordBatchFileReader. You should be able to iterate / read the file batch-by-batch. Hope this helps, Lubo On Wed, Nov 2, 2022 at 2:12 PM Albert Nadal <[email protected]> wrote: > Hi team. I recently started playing with the Python port of the Apache > Arrow to first learn how it works an then use it in our ML platform. > Currently we need to provide to our users a way to upload their datasets in > our storage platform (GCS and S3 mainly). Once the user uploaded a dataset > then we need to download that in order to properly convert each of its > records (rows) to an specific format we use internally in our platform > (protobuffers models). > > Our main concern is that we want to achieve that in a performant way by > avoiding to download the entire dataset. We are really interested to know > if its possible to fetch each RecordBatch of a dataset (Arrow file) stored > in a GCS bucket via streaming by using, for instance, the > RecordBatchStreamReader. I'm not really sure if this is possible without > downloading the entire dataset first. > > I made some small tests with GcsFileSystem, open_input_stream and > ipc.open_stream. > > > > *gcs = fs.GcsFileSystem(anonymous=True)with > gcs.open_input_stream("bucket/bigfile.arrow") as source: reader: > pa.ipc.RecordBatchStreamReader = pa.ipc.open_stream(source)* > > I'm not sure if I'm missing some important details here but anyways I > always got the same error. > > *pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but > only read 40168302* > > I hope you can help me with some indications about how we can handle the > streaming of Record Batches from a dataset stored in an external storage > filesystem. > > Thank you in advance! > > Albert, >
