I'm not sure what is meant by "streaming" in this  context.  My
understanding is that Parquet file reading needs RandomAccess.  In this
regard if you are trying to fetch from S3  A RandomAccessFile object using
the S3FileSystem
https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.h#L110
and
then create a Parquet file reader with the object.  I'm not sure if this
code path has been well tested.

On Fri, Nov 1, 2019 at 12:56 AM annsshadow <[email protected]> wrote:

> The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector
> which needs the Schema. It seems that I can't get the schema first and read
> the streaming parquet by arrow.<br/>In my situation, the parquet file is in
> the object system like S3. I can get it from the network slice by slice
> with any filesize, but can't hold the whole file in the memory and
> disk.<br/>Your reply indicates that the C++ can't read the streaming
> parquet now, so what should I try next with the arrow or anything
> else?<br/>Thank you for your work~~
> At 2019-11-01 01:46:32, "Wes McKinney" <[email protected]> wrote:
> >You will want to use the GetRecordBatchReader C++ API here
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152
> >
> >It may not be optimal for your use case. Support for streaming reads
> >is not yet exposed in Python or other bindings as far as I know.
> >
> >There is work happening in the C++ Datasets project to better support
> >this use case.
> >
> >On Wed, Oct 30, 2019 at 9:28 PM annsshadow <[email protected]> wrote:
> >>
> >>
> >> hi~
> >> I hava a question about reading parquet file.
> >> The offical example is reading the whole file from the local.
> >> Now I can't get the whole parquet file in the memory, only can fetch it
> slice by slice from the network, so how can I use arrow to read the parquet
> file?
> >> thank you~
>

Reply via email to