Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that will
let PostgreSQL read Parquet-format files.
I have just a few questions for now:
1) I have created a few sample Parquet data files using AWS Glue. Glue split
my CSV input into many (48) smaller xxx.snappy.parquet files, each about 30MB.
When I open one of these files using gparquet_arrow_file_reader_new_path(), I
can then call gparquet_arrow_file_reader_read_table() (and then access the
content of the table). However, …_read_table() seems to read the entire file
into memory all at once (I say that based on the amount of time it takes for
gparquet_arrow_file_reader_read_table() to return). That’s not the behavior I
need.
I have tried to use garrow_memory_mappend_input_stream_new() to open the file,
followed by garrow_record_batch_stream_reader_new(). The call to
garrow_record_batch_stream_reader_new() fails with the message:
[record-batch-stream-reader][open]: Invalid: Expected to read 827474256
metadata bytes, but only read 30284162
Does this error occur because Glue split the input data? Or because Glue
compressed the data using snappy? Do I need to uncompress before I can
read/open the file? Do I need to merge the files before I can open/read the
data?
2) If I use garrow_record_batch_stream_reader_new() instead of
gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading the
entire into memory before I fetch the first row?
Thanks in advance for help and any advice.
— Korry