FYI, Dremio (github.com/dremio/dremio-oss, dremio.com/download) supports doing a join between Parquet and Postgres using Arrow. Its a product, not a library.
Note, I founded Dremio. If you're wanting to build your own connector/setup I'm not trying to disrupt that. On Tue, Dec 4, 2018 at 8:51 PM Wes McKinney <[email protected]> wrote: > Not off-hand, but you can produce them by using pyarrow -- you can > create decimal arrays from Python's built-in decimal objects > On Tue, Dec 4, 2018 at 8:04 AM Korry Douglas <[email protected]> wrote: > > > > Thanks Wes, I should have asked one more question: can you point me to > any sample data files that I can try to read? > > > > — Korry > > > > > On Dec 3, 2018, at 10:19 PM, Wes McKinney <[email protected]> wrote: > > > > > > hi Korry, > > > > > > On Mon, Dec 3, 2018 at 3:56 PM Korry Douglas <[email protected]> wrote: > > >> > > >> I’ve been working on this project for a few weeks now and it’s going > well (at least, I think it is). > > >> > > >> I’m using the Parquet cpp API. As I mentioned earlier, I have used > AWS Glue to build some sample files - I can read those files now and even > make sense of them :-) > > >> > > >> Now I’m working on writing large batches to a parquet file. I can > read/write a few data types (strings, UUID’s, fixed-length strings, > booleans), but I’m having trouble with DECIMALs. If I understand > correctly, I can store a DECIMAL as an INT32, an INT64, or an FLBA (source: > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal > ). > > >> > > >> So a few questions: > > >> > > >> 1) Is the decimal position (scale) fixed for a given column? Or can > I mix scales within the same column? If I can mix them, how do I store the > actual scale with each value? > > > > > > Yes, it's fixed > > > > > >> > > >> 2) Can anyone point me to an example of how to build a DECIMAL value > based on an FLBA? Are there any classes that would help me build such (and > then deconstruct) such a value? > > > > > > Have a look at the Arrow write paths for decimals under > > > src/parquet/arrow. If using Arrow directly is not an option then you > > > could reuse the ideas from this code > > > > > >> > > >> Thanks in advance. > > >> > > >> > > >> — Korry > > >> > > >> > > >> > > >> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote: > > >> > > >> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) > that will let PostgreSQL read Parquet-format files. > > >> > > >> I have just a few questions for now: > > >> > > >> 1) I have created a few sample Parquet data files using AWS Glue. > Glue split my CSV input into many (48) smaller xxx.snappy.parquet files, > each about 30MB. When I open one of these files using > gparquet_arrow_file_reader_new_path(), I can then call > gparquet_arrow_file_reader_read_table() (and then access the content of the > table). However, …_read_table() seems to read the entire file into memory > all at once (I say that based on the amount of time it takes for > gparquet_arrow_file_reader_read_table() to return). That’s not the > behavior I need. > > >> > > >> I have tried to use garrow_memory_mappend_input_stream_new() to open > the file, followed by garrow_record_batch_stream_reader_new(). The call to > garrow_record_batch_stream_reader_new() fails with the message: > > >> > > >> [record-batch-stream-reader][open]: Invalid: Expected to read > 827474256 metadata bytes, but only read 30284162 > > >> > > >> Does this error occur because Glue split the input data? Or because > Glue compressed the data using snappy? Do I need to uncompress before I > can read/open the file? Do I need to merge the files before I can > open/read the data? > > >> > > >> > > >> > > >> 2) If I use garrow_record_batch_stream_reader_new() instead of > gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading > the entire into memory before I fetch the first row? > > >> > > >> > > >> Thanks in advance for help and any advice. > > >> > > >> > > >> — Korry > > >> > > >> > > >
