Re: New Copy Formats - avro/orc/parquet

2018-02-12 Thread Tom Lane
Magnus Hagander  writes:
> +1. And bonus points if an API can also be defined so such an extension
> parsing also becomes useful to file_fdw automatically (or at least
> optionally).

Hm, well, file_fdw already goes through COPY FROM, so it seems like it'd
almost just work.  "Almost" because there'd need to be a way for it to
support the appropriate options.  So this means that whatever the
mechanism is for extension-specific options, it has to be introspectable
enough for file_fdw.c to understand what to expose as FDW options.

regards, tom lane



Re: New Copy Formats - avro/orc/parquet

2018-02-12 Thread Magnus Hagander
On Sun, Feb 11, 2018 at 11:48 PM, Tom Lane  wrote:

> Andres Freund  writes:
> > So, I think making COPY extensible would be quite beneficial. I'm
> > however quite doubtful that we want to add core code to handle all of
> > the above. I think we should make the COPY input/output formatting
> > extensible by extensions.
>
> +1.  I can't see carrying code for these formats in-core, but I've
> no objection to making it possible for someone else to maintain them.
>

+1. And bonus points if an API can also be defined so such an extension
parsing also becomes useful to file_fdw automatically (or at least
optionally).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Tom Lane
Andres Freund  writes:
> On February 11, 2018 2:48:13 PM PST, Tom Lane  wrote:
>> (Any such patch should manage
>> to turn COPY-CSV into an extension, at least so far as copy.c is
>> concerned, even if we don't package it as one.)

> Yea, I was thinking we should move all three (default, csv, binary)
> supported formats to using such a facility.

Hm, yeah, probably.  That opens up a different dimension of variation:
is the raw data text or binary?  I'm thinking we shouldn't make the format
handlers reinvent e.g. encoding conversion if they don't have to, so it's
likely that some of the infrastructure for text data would remain in core
copy.c.

regards, tom lane



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Andres Freund


On February 11, 2018 2:48:13 PM PST, Tom Lane  wrote:
> (Any such patch should manage
>to turn COPY-CSV into an extension, at least so far as copy.c is
>concerned, even if we don't package it as one.)

Yea, I was thinking we should move all three (default, csv, binary) supported 
formats to using such a facility. I guess we'd just install the respective 
format handlers by default in the respective catalogs, rather than doing so in 
an extension. The handler function should easily be able to return pointers to 
functions in the main binary, not just additional shlibs.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Tom Lane
Andres Freund  writes:
> So, I think making COPY extensible would be quite beneficial. I'm
> however quite doubtful that we want to add core code to handle all of
> the above. I think we should make the COPY input/output formatting
> extensible by extensions.

+1.  I can't see carrying code for these formats in-core, but I've
no objection to making it possible for someone else to maintain them.

> I imagine we'd have callbacks for
> - start copy in / out
> - output row, with a an array of values/nulls
> - parse row, with a input buffer as argument, returning values / nulls arrays
> - finish copy in / out

Also something to allow absorbing format-specific options, if the
precedent of CSV is anything to go by.  (Any such patch should manage
to turn COPY-CSV into an extension, at least so far as copy.c is
concerned, even if we don't package it as one.)

regards, tom lane



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Nicolas Paris
Le 11 févr. 2018 à 22:19, Adrian Klaver écrivait :
> On 02/11/2018 12:57 PM, Nicolas Paris wrote:
> > Le 11 févr. 2018 à 21:53, Andres Freund écrivait :
> > > On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:
> > > > I have also the storage and network transfers overhead in mind:
> > > > All those new formats are compressed; this is not true for current
> > > > postgres BINARY format and obviously text based format. By experience,
> > > > the binary format is 10 to 30% larger than the text one. On the
> > > > contrary, an ORC file can be up to 10 times smaller than a text base
> > > > format.
> > > 
> > > That seems largely irrelevant when arguing about using PROGRAM though,
> > > right?
> > > 
> > 
> > Indeed those storage and network transfers are only considered versus
> > CSV/BINARY format. No link with PROGRAM aspect.
> > 
> 
> Just wondering what your time frame is on this? Asking because this would be
> considered a new feature and so would need to be added to a major release of
> Postgres. Currently work is going on for Postgres version 11 to be
> released(just a guess) late Fall 2018/early Winter 2019. The
> CommitFest(https://commitfest.postgresql.org/) for this release is currently
> approximately 3/4 of the way through. Not sure that new code could make it
> in at this point. This means it would be bumped to version 12 for 2019/2020.
> 

Right now, exporting (bilions rows * hundred columns) from postgres to
distributed tools such spark is feasible while beeing based on parsing,
transfers, tooling and workaround overhead.

Waiting until 2020 to get the oportunity to write COPY extensions would
mean using this feature around 2022. I mean, writing the ORC COPY
extension, extending the postgres JDBC driver, extending the spark jdbc
connector, all from different communities: this will be a long process.

But again, posgres would be the most advanced RDBMS because AFAIK not
any DB deal with those distributed format for the moment. Having in mind
that such feature will be released one day, make think the place of
postgres in a datawarehouse architecture accordingly.



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Adrian Klaver

On 02/11/2018 12:57 PM, Nicolas Paris wrote:

Le 11 févr. 2018 à 21:53, Andres Freund écrivait :

On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:

I have also the storage and network transfers overhead in mind:
All those new formats are compressed; this is not true for current
postgres BINARY format and obviously text based format. By experience,
the binary format is 10 to 30% larger than the text one. On the
contrary, an ORC file can be up to 10 times smaller than a text base
format.


That seems largely irrelevant when arguing about using PROGRAM though,
right?



Indeed those storage and network transfers are only considered versus
CSV/BINARY format. No link with PROGRAM aspect.



Just wondering what your time frame is on this? Asking because this 
would be considered a new feature and so would need to be added to a 
major release of Postgres. Currently work is going on for Postgres 
version 11 to be released(just a guess) late Fall 2018/early Winter 
2019. The CommitFest(https://commitfest.postgresql.org/) for this 
release is currently approximately 3/4 of the way through. Not sure that 
new code could make it in at this point. This means it would be bumped 
to version 12 for 2019/2020.



--
Adrian Klaver
adrian.kla...@aklaver.com



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Nicolas Paris
Le 11 févr. 2018 à 21:53, Andres Freund écrivait :
> On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:
> > I have also the storage and network transfers overhead in mind:
> > All those new formats are compressed; this is not true for current
> > postgres BINARY format and obviously text based format. By experience,
> > the binary format is 10 to 30% larger than the text one. On the
> > contrary, an ORC file can be up to 10 times smaller than a text base
> > format.
> 
> That seems largely irrelevant when arguing about using PROGRAM though,
> right?
> 

Indeed those storage and network transfers are only considered versus
CSV/BINARY format. No link with PROGRAM aspect.



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Andres Freund
On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:
> I have also the storage and network transfers overhead in mind:
> All those new formats are compressed; this is not true for current
> postgres BINARY format and obviously text based format. By experience,
> the binary format is 10 to 30% larger than the text one. On the
> contrary, an ORC file can be up to 10 times smaller than a text base
> format.

That seems largely irrelevant when arguing about using PROGRAM though,
right?

Greetings,

Andres Freund



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Nicolas Paris
Le 11 févr. 2018 à 21:03, Andres Freund écrivait :
> 
> 
> On February 11, 2018 12:00:12 PM PST, Nicolas Paris  
> wrote:
> >> > That is true, but the question is how significant the overhead is.
> >If
> >> > it's 50% then reducing it would make perfect sense. If it's 1% then
> >no
> >> > one if going to be bothered by it.
> >> 
> >> I think it's pretty clear that it's going to be way way much more
> >than
> >> 1%. 
> >
> >Good news but not sure to anderstand why.
> 
> I think you might have misunderstood my reply? I'm saying that going through 
> PROGRAM will have significant overhead. I can't quite make sense of the rest 
> of your reply otherwise?

True, I misunderstood. Then I agree the computation overhead should be
non-negligible.

I have also the storage and network transfers overhead in mind:
All those new formats are compressed; this is not true for current
postgres BINARY format and obviously text based format. By experience,
the binary format is 10 to 30% larger than the text one. On the
contrary, an ORC file can be up to 10 times smaller than a text base
format.



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Andres Freund


On February 11, 2018 12:00:12 PM PST, Nicolas Paris  wrote:
>> > That is true, but the question is how significant the overhead is.
>If
>> > it's 50% then reducing it would make perfect sense. If it's 1% then
>no
>> > one if going to be bothered by it.
>> 
>> I think it's pretty clear that it's going to be way way much more
>than
>> 1%. 
>
>Good news but not sure to anderstand why.

I think you might have misunderstood my reply? I'm saying that going through 
PROGRAM will have significant overhead. I can't quite make sense of the rest of 
your reply otherwise?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Nicolas Paris
> > That is true, but the question is how significant the overhead is. If
> > it's 50% then reducing it would make perfect sense. If it's 1% then no
> > one if going to be bothered by it.
> 
> I think it's pretty clear that it's going to be way way much more than
> 1%. 

Good news but not sure to anderstand why.

> It's trivial to construct cases where input parsing / output
> formatting takes the majority of the time. 

Binary -> ORC
^
|
   PROGRAM parsing/output formating on the fly

> And a lot of that you're going to be able to avoid with binary formats.

Still the above diagram shows both parsing/formating step, isn't it ?






Re: New Copy Formats - avro/orc/parquet

2018-02-11 Thread Andres Freund
Hi,

On 2018-02-10 18:21:37 +0100, Tomas Vondra wrote:
> That is true, but the question is how significant the overhead is. If
> it's 50% then reducing it would make perfect sense. If it's 1% then no
> one if going to be bothered by it.

I think it's pretty clear that it's going to be way way much more than
1%. It's trivial to construct cases where input parsing / output
formatting takes the majority of the time. And a lot of that you're
going to be able to avoid with binary formats.

Greetings,

Andres Freund



Re: New Copy Formats - avro/orc/parquet

2018-02-10 Thread Tomas Vondra


On 02/10/2018 04:30 PM, Nicolas Paris wrote:
>>> I d'found useful to be able to import/export from postgres to those modern 
>>> data
>>> formats:
>>> - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html)
>>> - parquet (c++ writer=https://github.com/apache/parquet-cpp)
>>> - orc (all writers=https://github.com/apache/orc)
>>>
>>> Something like :
>>> COPY table TO STDOUT ORC;
>>>
>>> Would be lovely.
>>>
>>> This would greatly enhance how postgres integrates in big-data ecosystem.
>>>
>>> Any thought ?
>>
>> https://www.postgresql.org/docs/10/static/sql-copy.html
>>
>> "PROGRAM
>>
>> A command to execute. In COPY FROM, the input is read from standard
>> output of the command, and in COPY TO, the output is written to the standard
>> input of the command.
>>
>> Note that the command is invoked by the shell, so if you need to pass
>> any arguments to shell command that come from an untrusted source, you must
>> be careful to strip or escape any special characters that might have a
>> special meaning for the shell. For security reasons, it is best to use a
>> fixed command string, or at least avoid passing any user input in it.
>> "
>>
> 
> PROGRAM would involve overhead of transforming data from CSV or
> BINARY to AVRO for example.
> 
> Here, I am talking about native format exports/imports for
> performance considerations.
> 

That is true, but the question is how significant the overhead is. If
it's 50% then reducing it would make perfect sense. If it's 1% then no
one if going to be bothered by it.

Without these numbers it's hard to make any judgments.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: New Copy Formats - avro/orc/parquet

2018-02-10 Thread Tomas Vondra

On 02/10/2018 04:38 PM, David G. Johnston wrote:
> On Saturday, February 10, 2018, Nicolas Paris  > wrote:
> 
> Hello
> 
> I d'found useful to be able to import/export from postgres to those
> modern data
> formats:
> - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html
> )
> - parquet (c++ writer=https://github.com/apache/parquet-cpp
> )
> - orc (all writers=https://github.com/apache/orc
> )
> 
> Something like :
> COPY table TO STDOUT ORC;
> 
> Would be lovely.
> 
> This would greatly enhance how postgres integrates in big-data
> ecosystem.
> 
>  
> It would be written "... with (format 'orc')" and your best bet would be
> to create an extension.  I don't think that having such code in core (or
> contrib) is desirable.
> 

I don't think make this extensible by an extension (i.e. the formats
supported by COPY are hard-coded in core). But I agree that if we are to
add multiple new formats, it'd be nice to allow doing that in extension.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: New Copy Formats - avro/orc/parquet

2018-02-10 Thread David G. Johnston
On Saturday, February 10, 2018, Nicolas Paris  wrote:

> Hello
>
> I d'found useful to be able to import/export from postgres to those modern
> data
> formats:
> - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html)
> - parquet (c++ writer=https://github.com/apache/parquet-cpp)
> - orc (all writers=https://github.com/apache/orc)
>
> Something like :
> COPY table TO STDOUT ORC;
>
> Would be lovely.
>
> This would greatly enhance how postgres integrates in big-data ecosystem.
>
>
It would be written "... with (format 'orc')" and your best bet would be to
create an extension.  I don't think that having such code in core (or
contrib) is desirable.

David J.


Re: New Copy Formats - avro/orc/parquet

2018-02-10 Thread Nicolas Paris
> > I d'found useful to be able to import/export from postgres to those modern 
> > data
> > formats:
> > - avro (c writer=https://avro.apache.org/docs/1.8.2/api/c/index.html)
> > - parquet (c++ writer=https://github.com/apache/parquet-cpp)
> > - orc (all writers=https://github.com/apache/orc)
> > 
> > Something like :
> > COPY table TO STDOUT ORC;
> > 
> > Would be lovely.
> > 
> > This would greatly enhance how postgres integrates in big-data ecosystem.
> > 
> > Any thought ?
> 
> https://www.postgresql.org/docs/10/static/sql-copy.html
> 
> "PROGRAM
> 
> A command to execute. In COPY FROM, the input is read from standard
> output of the command, and in COPY TO, the output is written to the standard
> input of the command.
> 
> Note that the command is invoked by the shell, so if you need to pass
> any arguments to shell command that come from an untrusted source, you must
> be careful to strip or escape any special characters that might have a
> special meaning for the shell. For security reasons, it is best to use a
> fixed command string, or at least avoid passing any user input in it.
> "
>

PROGRAM would involve overhead of transforming data from CSV or BINARY
to AVRO for example. 

Here, I am talking about native format exports/imports for performance
considerations.