Not off-hand, but you can produce them by using pyarrow -- you can
create decimal arrays from Python's built-in decimal objects
On Tue, Dec 4, 2018 at 8:04 AM Korry Douglas <[email protected]> wrote:
>
> Thanks Wes, I should have asked one more question: can you point me to any
> sample data files that I can try to read?
>
> — Korry
>
> > On Dec 3, 2018, at 10:19 PM, Wes McKinney <[email protected]> wrote:
> >
> > hi Korry,
> >
> > On Mon, Dec 3, 2018 at 3:56 PM Korry Douglas <[email protected]> wrote:
> >>
> >> I’ve been working on this project for a few weeks now and it’s going well
> >> (at least, I think it is).
> >>
> >> I’m using the Parquet cpp API. As I mentioned earlier, I have used AWS
> >> Glue to build some sample files - I can read those files now and even make
> >> sense of them :-)
> >>
> >> Now I’m working on writing large batches to a parquet file. I can
> >> read/write a few data types (strings, UUID’s, fixed-length strings,
> >> booleans), but I’m having trouble with DECIMALs. If I understand
> >> correctly, I can store a DECIMAL as an INT32, an INT64, or an FLBA
> >> (source:
> >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).
> >>
> >> So a few questions:
> >>
> >> 1) Is the decimal position (scale) fixed for a given column? Or can I mix
> >> scales within the same column? If I can mix them, how do I store the
> >> actual scale with each value?
> >
> > Yes, it's fixed
> >
> >>
> >> 2) Can anyone point me to an example of how to build a DECIMAL value based
> >> on an FLBA? Are there any classes that would help me build such (and then
> >> deconstruct) such a value?
> >
> > Have a look at the Arrow write paths for decimals under
> > src/parquet/arrow. If using Arrow directly is not an option then you
> > could reuse the ideas from this code
> >
> >>
> >> Thanks in advance.
> >>
> >>
> >> — Korry
> >>
> >>
> >>
> >> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote:
> >>
> >> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) that
> >> will let PostgreSQL read Parquet-format files.
> >>
> >> I have just a few questions for now:
> >>
> >> 1) I have created a few sample Parquet data files using AWS Glue. Glue
> >> split my CSV input into many (48) smaller xxx.snappy.parquet files, each
> >> about 30MB. When I open one of these files using
> >> gparquet_arrow_file_reader_new_path(), I can then call
> >> gparquet_arrow_file_reader_read_table() (and then access the content of
> >> the table). However, …_read_table() seems to read the entire file into
> >> memory all at once (I say that based on the amount of time it takes for
> >> gparquet_arrow_file_reader_read_table() to return). That’s not the
> >> behavior I need.
> >>
> >> I have tried to use garrow_memory_mappend_input_stream_new() to open the
> >> file, followed by garrow_record_batch_stream_reader_new(). The call to
> >> garrow_record_batch_stream_reader_new() fails with the message:
> >>
> >> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256
> >> metadata bytes, but only read 30284162
> >>
> >> Does this error occur because Glue split the input data? Or because Glue
> >> compressed the data using snappy? Do I need to uncompress before I can
> >> read/open the file? Do I need to merge the files before I can open/read
> >> the data?
> >>
> >>
> >>
> >> 2) If I use garrow_record_batch_stream_reader_new() instead of
> >> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of
> >> reading the entire into memory before I fetch the first row?
> >>
> >>
> >> Thanks in advance for help and any advice.
> >>
> >>
> >> — Korry
> >>
> >>
>