I’ve managed to create some sample data files using a DataFrame and pyarrow - thanks for the hints.
I’m afraid I’m still stuck on DECIMALs stored in an FLBA. According to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, I think I should be able to store values of arbitrary (but fixed) length. The only examples I can find are for Decimal128 - but a Decimal128 can store no more than 34 decimal digits (according to https://en.wikipedia.org/wiki/Decimal128_floating-point_format). That’s not an arbitrary length. I’m working in C++ so Java examples don’t get me very far. My parquet reader uses a parquet::FixedLenByteArrayReader to fetch a value: parquet::FixedLenByteArray val; result = valReader->ReadBatch(1, &def_values, &rep_values, &val, &rowsRead); After this, I can look at the bytes pointed to by val.ptr. I can make a bit of sense out of those bytes, but I would rather not reverse engineer the storage format. I suppose for now that I should convert the FixedLenByteArray to a string and then from string form into the PostgreSQL NUMERIC format (when reading the FBLA from a file), and then do the opposite while writing. Is there a class/function that will convert a DECIMAL FLBA value to a string (and another function to convert back again)? Keep in mind that a Decimal128 is probably not the right way to go - assume that I need to support 50 digit values. Thanks in advance. — Korry > On Dec 3, 2018, at 10:19 PM, Wes McKinney <[email protected]> wrote: > > hi Korry, > > On Mon, Dec 3, 2018 at 3:56 PM Korry Douglas <[email protected]> wrote: >> >> I’ve been working on this project for a few weeks now and it’s going well >> (at least, I think it is). >> >> I’m using the Parquet cpp API. As I mentioned earlier, I have used AWS Glue >> to build some sample files - I can read those files now and even make sense >> of them :-) >> >> Now I’m working on writing large batches to a parquet file. I can >> read/write a few data types (strings, UUID’s, fixed-length strings, >> booleans), but I’m having trouble with DECIMALs. If I understand correctly, >> I can store a DECIMAL as an INT32, an INT64, or an FLBA (source: >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal). >> >> So a few questions: >> >> 1) Is the decimal position (scale) fixed for a given column? Or can I mix >> scales within the same column? If I can mix them, how do I store the actual >> scale with each value? > > Yes, it's fixed > >> >> 2) Can anyone point me to an example of how to build a DECIMAL value based >> on an FLBA? Are there any classes that would help me build such (and then >> deconstruct) such a value? > > Have a look at the Arrow write paths for decimals under > src/parquet/arrow. If using Arrow directly is not an option then you > could reuse the ideas from this code > >> >> Thanks in advance. >> >> >> — Korry >> >> >> >> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote: >> >> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that >> will let PostgreSQL read Parquet-format files. >> >> I have just a few questions for now: >> >> 1) I have created a few sample Parquet data files using AWS Glue. Glue >> split my CSV input into many (48) smaller xxx.snappy.parquet files, each >> about 30MB. When I open one of these files using >> gparquet_arrow_file_reader_new_path(), I can then call >> gparquet_arrow_file_reader_read_table() (and then access the content of the >> table). However, …_read_table() seems to read the entire file into memory >> all at once (I say that based on the amount of time it takes for >> gparquet_arrow_file_reader_read_table() to return). That’s not the >> behavior I need. >> >> I have tried to use garrow_memory_mappend_input_stream_new() to open the >> file, followed by garrow_record_batch_stream_reader_new(). The call to >> garrow_record_batch_stream_reader_new() fails with the message: >> >> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 >> metadata bytes, but only read 30284162 >> >> Does this error occur because Glue split the input data? Or because Glue >> compressed the data using snappy? Do I need to uncompress before I can >> read/open the file? Do I need to merge the files before I can open/read the >> data? >> >> >> >> 2) If I use garrow_record_batch_stream_reader_new() instead of >> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading >> the entire into memory before I fetch the first row? >> >> >> Thanks in advance for help and any advice. >> >> >> — Korry >> >>
