Hi, I think that we can use parquet::arrow::FileReader::GetRecordBatchReader() https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175 for this propose.
It doesn't read the specified number of rows but it'll read only rows in each row group. (Do I misunderstand?) Thanks, -- kou In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com> "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500, Wes McKinney <[email protected]> wrote: > garrow_record_batch_stream_reader_new() is for reading files that use > the stream IPC protocol described in > https://github.com/apache/arrow/blob/master/format/IPC.md, not for > Parquet files > > We don't have a streaming reader implemented yet for Parquet files. > The relevant JIRA (a bit thin on detail) is > https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean > to implement this interface, with the option to read some number of > "rows" at a time: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166 > On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote: >> >> Hi, >> >> We didn't implement record batch reader feature for Parquet >> in C API yet. It's easy to implement. So we can provide the >> feature in the next release. Can you open a JIRA issue for >> this feature? You can find "Create" button at >> https://issues.apache.org/jira/projects/ARROW/issues/ >> >> If you can use C++ API, you can use the feature with the >> current release. >> >> >> Thanks, >> -- >> kou >> >> In <[email protected]> >> "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, >> Korry Douglas <[email protected]> wrote: >> >> > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that >> > will let PostgreSQL read Parquet-format files. >> > >> > I have just a few questions for now: >> > >> > 1) I have created a few sample Parquet data files using AWS Glue. Glue >> > split my CSV input into many (48) smaller xxx.snappy.parquet files, each >> > about 30MB. When I open one of these files using >> > gparquet_arrow_file_reader_new_path(), I can then call >> > gparquet_arrow_file_reader_read_table() (and then access the content of >> > the table). However, …_read_table() seems to read the entire file into >> > memory all at once (I say that based on the amount of time it takes for >> > gparquet_arrow_file_reader_read_table() to return). That’s not the >> > behavior I need. >> > >> > I have tried to use garrow_memory_mappend_input_stream_new() to open the >> > file, followed by garrow_record_batch_stream_reader_new(). The call to >> > garrow_record_batch_stream_reader_new() fails with the message: >> > >> > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 >> > metadata bytes, but only read 30284162 >> > >> > Does this error occur because Glue split the input data? Or because Glue >> > compressed the data using snappy? Do I need to uncompress before I can >> > read/open the file? Do I need to merge the files before I can open/read >> > the data? >> > >> > 2) If I use garrow_record_batch_stream_reader_new() instead of >> > gparquet_arrow_file_reader_new_path(), will I avoid the overhead of >> > reading the entire into memory before I fetch the first row? >> > >> > >> > Thanks in advance for help and any advice. >> > >> > >> > ― Korry
