garrow_record_batch_stream_reader_new() is for reading files that use the stream IPC protocol described in https://github.com/apache/arrow/blob/master/format/IPC.md, not for Parquet files
We don't have a streaming reader implemented yet for Parquet files. The relevant JIRA (a bit thin on detail) is https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean to implement this interface, with the option to read some number of "rows" at a time: https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166 On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote: > > Hi, > > We didn't implement record batch reader feature for Parquet > in C API yet. It's easy to implement. So we can provide the > feature in the next release. Can you open a JIRA issue for > this feature? You can find "Create" button at > https://issues.apache.org/jira/projects/ARROW/issues/ > > If you can use C++ API, you can use the feature with the > current release. > > > Thanks, > -- > kou > > In <[email protected]> > "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, > Korry Douglas <[email protected]> wrote: > > > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that > > will let PostgreSQL read Parquet-format files. > > > > I have just a few questions for now: > > > > 1) I have created a few sample Parquet data files using AWS Glue. Glue > > split my CSV input into many (48) smaller xxx.snappy.parquet files, each > > about 30MB. When I open one of these files using > > gparquet_arrow_file_reader_new_path(), I can then call > > gparquet_arrow_file_reader_read_table() (and then access the content of the > > table). However, …_read_table() seems to read the entire file into memory > > all at once (I say that based on the amount of time it takes for > > gparquet_arrow_file_reader_read_table() to return). That’s not the > > behavior I need. > > > > I have tried to use garrow_memory_mappend_input_stream_new() to open the > > file, followed by garrow_record_batch_stream_reader_new(). The call to > > garrow_record_batch_stream_reader_new() fails with the message: > > > > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 > > metadata bytes, but only read 30284162 > > > > Does this error occur because Glue split the input data? Or because Glue > > compressed the data using snappy? Do I need to uncompress before I can > > read/open the file? Do I need to merge the files before I can open/read > > the data? > > > > 2) If I use garrow_record_batch_stream_reader_new() instead of > > gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading > > the entire into memory before I fetch the first row? > > > > > > Thanks in advance for help and any advice. > > > > > > ― Korry
