Re: Joining Parquet & PostgreSQL

Korry Douglas Mon, 03 Dec 2018 13:56:49 -0800

I’ve been working on this project for a few weeks now and it’s going well (at 
least, I think it is).

I’m using the Parquet cpp API.  As I mentioned earlier, I have used AWS Glue to 
build some sample files - I can read those files now and even make sense of 
them :-)

Now I’m working on writing large batches to a parquet file.  I can read/write a 
few data types (strings, UUID’s, fixed-length strings, booleans), but I’m 
having trouble with DECIMALs.  If I understand correctly, I can store a DECIMAL 
as an INT32, an INT64, or an FLBA (source: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal). 

So a few questions:

1) Is the decimal position (scale) fixed for a given column?  Or can I mix 
scales within the same column?  If I can mix them, how do I store the actual 
scale with each value?

2) Can anyone point me to an example of how to build a DECIMAL value based on 
an FLBA?  Are there any classes that would help me build such (and then 
deconstruct) such a value?

Thanks in advance.

               — Korry

> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote:
> 
> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that 
> will let PostgreSQL read Parquet-format files.
> 
> I have just a few questions for now:
> 
> 1) I have created a few sample Parquet data files using AWS Glue.  Glue split 
> my CSV input into many (48) smaller xxx.snappy.parquet files, each about 
> 30MB. When I open one of these files using 
> gparquet_arrow_file_reader_new_path(), I can then call 
> gparquet_arrow_file_reader_read_table() (and then access the content of the 
> table).  However, …_read_table() seems to read the entire file into memory 
> all at once (I say that based on the amount of time it takes for 
> gparquet_arrow_file_reader_read_table() to return).   That’s not the behavior 
> I need.
> 
> I have tried to use garrow_memory_mappend_input_stream_new() to open the 
> file, followed by garrow_record_batch_stream_reader_new().  The call to 
> garrow_record_batch_stream_reader_new() fails with the message:
> 
> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
> metadata bytes, but only read 30284162
> 
> Does this error occur because Glue split the input data?  Or because Glue 
> compressed the data using snappy?  Do I need to uncompress before I can 
> read/open the file?  Do I need to merge the files before I can open/read the 
> data?
>  
> 2) If I use garrow_record_batch_stream_reader_new() instead of 
> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading 
> the entire into memory before I fetch the first row?
> 
> 
> Thanks in advance for help and any advice.  
> 
> 
>             — Korry

Re: Joining Parquet & PostgreSQL

Reply via email to