Re: Joining Parquet & PostgreSQL

Korry Douglas Fri, 16 Nov 2018 06:28:27 -0800

Thanks Kouhei and Wes for the fast response, much appreciated.

C++ is a bit troublesome for me because of the difference between PostgreSQL 
exception handling (setjmp/longjmp) and C++ exception handling (throw/catch) - 
I’m worried that destructors might not get invoked properly when cleaning up 
errors in Postgres.


I’ve found very few examples on the web that demonstrate how to use the Parquet 
C or C++ API’s.  Are you aware of any projects that I might look into to 
understand how to use the APIs?  Any blogs that might be helpful?



                   — Korry


> On Nov 16, 2018, at 8:41 AM, Wes McKinney <[email protected]> wrote:
> 
> That will work, but the size of a single row group could be very large
> 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> 
> This function also appears to have a bug in it. If any column is a
> ChunkedArray after calling ReadRowGroup, then the call to
> TableBatchReader::ReadNext will return only part of the row group
> 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200
> 
> I opened https://issues.apache.org/jira/browse/ARROW-3822
> On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <[email protected]> wrote:
>> 
>> Hi,
>> 
>> I think that we can use
>> parquet::arrow::FileReader::GetRecordBatchReader()
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175
>> for this propose.
>> 
>> It doesn't read the specified number of rows but it'll read
>> only rows in each row group.
>> (Do I misunderstand?)
>> 
>> 
>> Thanks,
>> --
>> kou
>> 
>> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com>
>>  "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
>>  Wes McKinney <[email protected]> wrote:
>> 
>>> garrow_record_batch_stream_reader_new() is for reading files that use
>>> the stream IPC protocol described in
>>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for
>>> Parquet files
>>> 
>>> We don't have a streaming reader implemented yet for Parquet files.
>>> The relevant JIRA (a bit thin on detail) is
>>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
>>> to implement this interface, with the option to read some number of
>>> "rows" at a time:
>>> 
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166
>>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> We didn't implement record batch reader feature for Parquet
>>>> in C API yet. It's easy to implement. So we can provide the
>>>> feature in the next release. Can you open a JIRA issue for
>>>> this feature? You can find "Create" button at
>>>> https://issues.apache.org/jira/projects/ARROW/issues/
>>>> 
>>>> If you can use C++ API, you can use the feature with the
>>>> current release.
>>>> 
>>>> 
>>>> Thanks,
>>>> --
>>>> kou
>>>> 
>>>> In <[email protected]>
>>>>  "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
>>>>  Korry Douglas <[email protected]> wrote:
>>>> 
>>>>> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) 
>>>>> that will let PostgreSQL read Parquet-format files.
>>>>> 
>>>>> I have just a few questions for now:
>>>>> 
>>>>> 1) I have created a few sample Parquet data files using AWS Glue.  Glue 
>>>>> split my CSV input into many (48) smaller xxx.snappy.parquet files, each 
>>>>> about 30MB. When I open one of these files using 
>>>>> gparquet_arrow_file_reader_new_path(), I can then call 
>>>>> gparquet_arrow_file_reader_read_table() (and then access the content of 
>>>>> the table).  However, …_read_table() seems to read the entire file into 
>>>>> memory all at once (I say that based on the amount of time it takes for 
>>>>> gparquet_arrow_file_reader_read_table() to return).   That’s not the 
>>>>> behavior I need.
>>>>> 
>>>>> I have tried to use garrow_memory_mappend_input_stream_new() to open the 
>>>>> file, followed by garrow_record_batch_stream_reader_new().  The call to 
>>>>> garrow_record_batch_stream_reader_new() fails with the message:
>>>>> 
>>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
>>>>> metadata bytes, but only read 30284162
>>>>> 
>>>>> Does this error occur because Glue split the input data?  Or because Glue 
>>>>> compressed the data using snappy?  Do I need to uncompress before I can 
>>>>> read/open the file?  Do I need to merge the files before I can open/read 
>>>>> the data?
>>>>> 
>>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of 
>>>>> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of 
>>>>> reading the entire into memory before I fetch the first row?
>>>>> 
>>>>> 
>>>>> Thanks in advance for help and any advice.
>>>>> 
>>>>> 
>>>>>            ― Korry

Re: Joining Parquet & PostgreSQL

Reply via email to