Joining Parquet & PostgreSQL

Korry Douglas Thu, 15 Nov 2018 10:04:30 -0800

Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that will 
let PostgreSQL read Parquet-format files.


I have just a few questions for now:

1) I have created a few sample Parquet data files using AWS Glue.  Glue split 
my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB. 
When I open one of these files using gparquet_arrow_file_reader_new_path(), I 
can then call gparquet_arrow_file_reader_read_table() (and then access the 
content of the table).  However, …_read_table() seems to read the entire file 
into memory all at once (I say that based on the amount of time it takes for 
gparquet_arrow_file_reader_read_table() to return).   That’s not the behavior I 
need.

I have tried to use garrow_memory_mappend_input_stream_new() to open the file, 
followed by garrow_record_batch_stream_reader_new().  The call to 
garrow_record_batch_stream_reader_new() fails with the message:

[record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
metadata bytes, but only read 30284162

Does this error occur because Glue split the input data?  Or because Glue 
compressed the data using snappy?  Do I need to uncompress before I can 
read/open the file?  Do I need to merge the files before I can open/read the 
data?
 
2) If I use garrow_record_batch_stream_reader_new() instead of 
gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading the 
entire into memory before I fetch the first row?


Thanks in advance for help and any advice.  


            — Korry

Joining Parquet & PostgreSQL

Reply via email to