I'm not sure what is meant by "streaming" in this context. My understanding is that Parquet file reading needs RandomAccess. In this regard if you are trying to fetch from S3 A RandomAccessFile object using the S3FileSystem https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.h#L110 and then create a Parquet file reader with the object. I'm not sure if this code path has been well tested.
On Fri, Nov 1, 2019 at 12:56 AM annsshadow <[email protected]> wrote: > The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector > which needs the Schema. It seems that I can't get the schema first and read > the streaming parquet by arrow.<br/>In my situation, the parquet file is in > the object system like S3. I can get it from the network slice by slice > with any filesize, but can't hold the whole file in the memory and > disk.<br/>Your reply indicates that the C++ can't read the streaming > parquet now, so what should I try next with the arrow or anything > else?<br/>Thank you for your work~~ > At 2019-11-01 01:46:32, "Wes McKinney" <[email protected]> wrote: > >You will want to use the GetRecordBatchReader C++ API here > > > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152 > > > >It may not be optimal for your use case. Support for streaming reads > >is not yet exposed in Python or other bindings as far as I know. > > > >There is work happening in the C++ Datasets project to better support > >this use case. > > > >On Wed, Oct 30, 2019 at 9:28 PM annsshadow <[email protected]> wrote: > >> > >> > >> hi~ > >> I hava a question about reading parquet file. > >> The offical example is reading the whole file from the local. > >> Now I can't get the whole parquet file in the memory, only can fetch it > slice by slice from the network, so how can I use arrow to read the parquet > file? > >> thank you~ >
