Re: Stream Record Batches from an Arrow stored in a GCS storage

Lubomir Slivka Wed, 02 Nov 2022 08:55:38 -0700

Hi Albert,

I think you are running into this error due to mismatch of IPC formats -
not 100% sure but I tried locally and I get a very similar error when I
intentionally mismatch.


It seems the file on GCS is in IPC File format and you are trying to read
this using a reader designated for IPC Stream format (open_stream,
RecordBatchStreamReader).

Check out the pa.ipc.open_file() function. This returns
RecordBatchFileReader. You should be able to iterate / read the file
batch-by-batch.

Hope this helps,
Lubo

On Wed, Nov 2, 2022 at 2:12 PM Albert Nadal <[email protected]> wrote:

> Hi team. I recently started playing with the Python port of the Apache
> Arrow to first learn how it works an then use it in our ML platform.
> Currently we need to provide to our users a way to upload their datasets in
> our storage platform (GCS and S3 mainly). Once the user uploaded a dataset
> then we need to download that in order to properly convert each of its
> records (rows) to an specific format we use internally in our platform
> (protobuffers models).
>
> Our main concern is that we want to achieve that in a performant way by
> avoiding to download the entire dataset. We are really interested to know
> if its possible to fetch each RecordBatch of a dataset (Arrow file) stored
> in a GCS bucket via streaming by using, for instance, the
> RecordBatchStreamReader. I'm not really sure if this is possible without
> downloading the entire dataset first.
>
> I made some small tests with GcsFileSystem, open_input_stream and
> ipc.open_stream.
>
>
>
> *gcs = fs.GcsFileSystem(anonymous=True)with
> gcs.open_input_stream("bucket/bigfile.arrow") as source:        reader:
> pa.ipc.RecordBatchStreamReader = pa.ipc.open_stream(source)*
>
> I'm not sure if I'm missing some important details here but anyways I
> always got the same error.
>
> *pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but
> only read 40168302*
>
> I hope you can help me with some indications about how we can handle the
> streaming of Record Batches from a dataset stored in an external storage
> filesystem.
>
> Thank you in advance!
>
> Albert,
>

Re: Stream Record Batches from an Arrow stored in a GCS storage

Reply via email to