Re: Joining Parquet & PostgreSQL

Wes McKinney Tue, 11 Dec 2018 14:44:20 -0800

hi Korry,

I'm afraid our support of decimals (beyond opaquely handling
fixed-len-byte-array data) so far in the C++ library is limited to
what can be represented with Decimal128 -- there are quite a few
Decimal128 functions available in the codebase. The reason we haven't
done more is purely a function of the needs of the developers involved
in the project. Would you be able to submit some pull requests to add
the features you need?


- Wes
On Tue, Dec 11, 2018 at 3:53 PM Korry Douglas <[email protected]> wrote:
>
> I’ve managed to create some sample data files using a DataFrame and pyarrow - 
> thanks for the hints.
>
> I’m afraid I’m still stuck on DECIMALs stored in an FLBA.
>
> According to 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, 
> I think I should be able to store values of arbitrary (but fixed) length.  
> The only examples I can find are for Decimal128 - but a Decimal128 can store 
> no more than 34 decimal digits (according to 
> https://en.wikipedia.org/wiki/Decimal128_floating-point_format).  That’s not 
> an arbitrary length.
>
> I’m working in C++ so Java examples don’t get me very far.
>
> My parquet reader uses a parquet::FixedLenByteArrayReader to fetch a value:
>
> parquet::FixedLenByteArray  val;
>
>
> result = valReader->ReadBatch(1, &def_values, &rep_values, &val, &rowsRead);
>
>
> After this, I can look at the bytes pointed to by val.ptr.  I can make a bit 
> of sense out of those bytes, but I would rather not reverse engineer the 
> storage format.
>
> I suppose for now that I should convert the FixedLenByteArray to a string and 
> then from string form into the PostgreSQL NUMERIC format (when reading the 
> FBLA from a file), and then do the opposite while writing.
>
> Is there a class/function that will convert a DECIMAL FLBA value to a string 
> (and another function to convert back again)?  Keep in mind that a Decimal128 
> is probably not the right way to go - assume that I need to support 50 digit 
> values.
>
> Thanks in advance.
>
>                   — Korry
>
>
> On Dec 3, 2018, at 10:19 PM, Wes McKinney <[email protected]> wrote:
>
> hi Korry,
>
> On Mon, Dec 3, 2018 at 3:56 PM Korry Douglas <[email protected]> wrote:
>
>
> I’ve been working on this project for a few weeks now and it’s going well (at 
> least, I think it is).
>
> I’m using the Parquet cpp API.  As I mentioned earlier, I have used AWS Glue 
> to build some sample files - I can read those files now and even make sense 
> of them :-)
>
> Now I’m working on writing large batches to a parquet file.  I can read/write 
> a few data types (strings, UUID’s, fixed-length strings, booleans), but I’m 
> having trouble with DECIMALs.  If I understand correctly, I can store a 
> DECIMAL as an INT32, an INT64, or an FLBA (source: 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).
>
> So a few questions:
>
> 1) Is the decimal position (scale) fixed for a given column?  Or can I mix 
> scales within the same column?  If I can mix them, how do I store the actual 
> scale with each value?
>
>
> Yes, it's fixed
>
>
> 2) Can anyone point me to an example of how to build a DECIMAL value based on 
> an FLBA?  Are there any classes that would help me build such (and then 
> deconstruct) such a value?
>
>
> Have a look at the Arrow write paths for decimals under
> src/parquet/arrow. If using Arrow directly is not an option then you
> could reuse the ideas from this code
>
>
> Thanks in advance.
>
>
>               — Korry
>
>
>
> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote:
>
> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that 
> will let PostgreSQL read Parquet-format files.
>
> I have just a few questions for now:
>
> 1) I have created a few sample Parquet data files using AWS Glue.  Glue split 
> my CSV input into many (48) smaller xxx.snappy.parquet files, each about 
> 30MB. When I open one of these files using 
> gparquet_arrow_file_reader_new_path(), I can then call 
> gparquet_arrow_file_reader_read_table() (and then access the content of the 
> table).  However, …_read_table() seems to read the entire file into memory 
> all at once (I say that based on the amount of time it takes for 
> gparquet_arrow_file_reader_read_table() to return).   That’s not the behavior 
> I need.
>
> I have tried to use garrow_memory_mappend_input_stream_new() to open the 
> file, followed by garrow_record_batch_stream_reader_new().  The call to 
> garrow_record_batch_stream_reader_new() fails with the message:
>
> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
> metadata bytes, but only read 30284162
>
> Does this error occur because Glue split the input data?  Or because Glue 
> compressed the data using snappy?  Do I need to uncompress before I can 
> read/open the file?  Do I need to merge the files before I can open/read the 
> data?
>
>
>
> 2) If I use garrow_record_batch_stream_reader_new() instead of 
> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading 
> the entire into memory before I fetch the first row?
>
>
> Thanks in advance for help and any advice.
>
>
>            — Korry
>
>
>

Re: Joining Parquet & PostgreSQL

Reply via email to