Re: Joining Parquet & PostgreSQL

Jacques Nadeau Tue, 04 Dec 2018 20:22:03 -0800

FYI, Dremio (github.com/dremio/dremio-oss, dremio.com/download) supports
doing a join between Parquet and Postgres using Arrow. Its a product, not a
library.


Note, I founded Dremio. If you're wanting to build your own connector/setup
I'm not trying to disrupt that.

On Tue, Dec 4, 2018 at 8:51 PM Wes McKinney <[email protected]> wrote:

> Not off-hand, but you can produce them by using pyarrow -- you can
> create decimal arrays from Python's built-in decimal objects
> On Tue, Dec 4, 2018 at 8:04 AM Korry Douglas <[email protected]> wrote:
> >
> > Thanks Wes, I should have asked one more question:  can you point me to
> any sample data files that I can try to read?
> >
> >                     — Korry
> >
> > > On Dec 3, 2018, at 10:19 PM, Wes McKinney <[email protected]> wrote:
> > >
> > > hi Korry,
> > >
> > > On Mon, Dec 3, 2018 at 3:56 PM Korry Douglas <[email protected]> wrote:
> > >>
> > >> I’ve been working on this project for a few weeks now and it’s going
> well (at least, I think it is).
> > >>
> > >> I’m using the Parquet cpp API.  As I mentioned earlier, I have used
> AWS Glue to build some sample files - I can read those files now and even
> make sense of them :-)
> > >>
> > >> Now I’m working on writing large batches to a parquet file.  I can
> read/write a few data types (strings, UUID’s, fixed-length strings,
> booleans), but I’m having trouble with DECIMALs.  If I understand
> correctly, I can store a DECIMAL as an INT32, an INT64, or an FLBA (source:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> ).
> > >>
> > >> So a few questions:
> > >>
> > >> 1) Is the decimal position (scale) fixed for a given column?  Or can
> I mix scales within the same column?  If I can mix them, how do I store the
> actual scale with each value?
> > >
> > > Yes, it's fixed
> > >
> > >>
> > >> 2) Can anyone point me to an example of how to build a DECIMAL value
> based on an FLBA?  Are there any classes that would help me build such (and
> then deconstruct) such a value?
> > >
> > > Have a look at the Arrow write paths for decimals under
> > > src/parquet/arrow. If using Arrow directly is not an option then you
> > > could reuse the ideas from this code
> > >
> > >>
> > >> Thanks in advance.
> > >>
> > >>
> > >>               — Korry
> > >>
> > >>
> > >>
> > >> On Nov 15, 2018, at 12:56 PM, Korry Douglas <[email protected]> wrote:
> > >>
> > >> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW)
> that will let PostgreSQL read Parquet-format files.
> > >>
> > >> I have just a few questions for now:
> > >>
> > >> 1) I have created a few sample Parquet data files using AWS Glue.
> Glue split my CSV input into many (48) smaller xxx.snappy.parquet files,
> each about 30MB. When I open one of these files using
> gparquet_arrow_file_reader_new_path(), I can then call
> gparquet_arrow_file_reader_read_table() (and then access the content of the
> table).  However, …_read_table() seems to read the entire file into memory
> all at once (I say that based on the amount of time it takes for
> gparquet_arrow_file_reader_read_table() to return).   That’s not the
> behavior I need.
> > >>
> > >> I have tried to use garrow_memory_mappend_input_stream_new() to open
> the file, followed by garrow_record_batch_stream_reader_new().  The call to
> garrow_record_batch_stream_reader_new() fails with the message:
> > >>
> > >> [record-batch-stream-reader][open]: Invalid: Expected to read
> 827474256 metadata bytes, but only read 30284162
> > >>
> > >> Does this error occur because Glue split the input data?  Or because
> Glue compressed the data using snappy?  Do I need to uncompress before I
> can read/open the file?  Do I need to merge the files before I can
> open/read the data?
> > >>
> > >>
> > >>
> > >> 2) If I use garrow_record_batch_stream_reader_new() instead of
> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of reading
> the entire into memory before I fetch the first row?
> > >>
> > >>
> > >> Thanks in advance for help and any advice.
> > >>
> > >>
> > >>            — Korry
> > >>
> > >>
> >
>

Re: Joining Parquet & PostgreSQL

Reply via email to