Re: [HACKERS] COPY Transform support

2008-04-08 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Is there a big demand for multiple datasets on the wire in a situation 
> like this? How about if we allow multiple COPY targets but at most one 
> from STDIN, at least for one go round?

That's exactly what I was saying (or at least trying to imply) as the
fallback position.  But you still need a way to enforce that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-08 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
  
Could we make each COPY target 
behave like an SRF, stashing its data in a tuplestore?



The first question is what is the wire-protocol definition.  In
particular, how would the client know what order to send the COPY
datasets in, if a single query might include multiple COPY FROM STDIN
segments?

Another point is that we surely don't want the implementation to force
use of a tuplestore all the time, so I'm not sure I buy that we can
prevent interleaving of multiple datasets on the wire that way.


  


Is there a big demand for multiple datasets on the wire in a situation 
like this? How about if we allow multiple COPY targets but at most one 
from STDIN, at least for one go round?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-08 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> (One of the issues that'd have to be addressed to allow a table source
>> syntax is whether it's sane to allow multiple COPY FROM STDIN in a
>> single query.  If so, how does it work; if not, how do we prevent it?)

> I don't see why it shouldn't work. I see that copy.c now looks like it's 
> reentrant, unlike the bad days of old. Could we make each COPY target 
> behave like an SRF, stashing its data in a tuplestore?

The first question is what is the wire-protocol definition.  In
particular, how would the client know what order to send the COPY
datasets in, if a single query might include multiple COPY FROM STDIN
segments?

Another point is that we surely don't want the implementation to force
use of a tuplestore all the time, so I'm not sure I buy that we can
prevent interleaving of multiple datasets on the wire that way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-08 Thread Andrew Dunstan



Tom Lane wrote:

Dimitri Fontaine <[EMAIL PROTECTED]> writes:
  

Le mardi 08 avril 2008, Tom Lane a écrit :


That's sufficiently covered by the proposal to allow a COPY FROM as a
table source within SELECT, no?
  


  
Well, yes, the table source has text as datatypes and the select expression on 
the column will call whatever function/cast etc to do the work. But that 
opens the door to second class citizen storage solution for PostgreSQL, by 
letting the user CREATE VIEW atop of it:



[ shrug... ]  I don't actually have a problem with that.  If we did want
to prohibit that, we'd have to somehow prohibit SRFs from reading files,
because you can do it today with any untrusted PL.

I note also that presumably COPY FROM 'file' would still be restricted
to superusers, and only COPY FROM STDIN would be available to those
without a license to shoot themselves in the foot.  So the opportunity
to do anything view-like would be restricted to adults(?) anyhow.
  



Yeah, maybe. I will suspend my doubts for now.


(One of the issues that'd have to be addressed to allow a table source
syntax is whether it's sane to allow multiple COPY FROM STDIN in a
single query.  If so, how does it work; if not, how do we prevent it?)


  


I don't see why it shouldn't work. I see that copy.c now looks like it's 
reentrant, unlike the bad days of old. Could we make each COPY target 
behave like an SRF, stashing its data in a tuplestore?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-08 Thread Tom Lane
Dimitri Fontaine <[EMAIL PROTECTED]> writes:
> Le mardi 08 avril 2008, Tom Lane a écrit :
>> That's sufficiently covered by the proposal to allow a COPY FROM as a
>> table source within SELECT, no?

> Well, yes, the table source has text as datatypes and the select expression 
> on 
> the column will call whatever function/cast etc to do the work. But that 
> opens the door to second class citizen storage solution for PostgreSQL, by 
> letting the user CREATE VIEW atop of it:

[ shrug... ]  I don't actually have a problem with that.  If we did want
to prohibit that, we'd have to somehow prohibit SRFs from reading files,
because you can do it today with any untrusted PL.

I note also that presumably COPY FROM 'file' would still be restricted
to superusers, and only COPY FROM STDIN would be available to those
without a license to shoot themselves in the foot.  So the opportunity
to do anything view-like would be restricted to adults(?) anyhow.

(One of the issues that'd have to be addressed to allow a table source
syntax is whether it's sane to allow multiple COPY FROM STDIN in a
single query.  If so, how does it work; if not, how do we prevent it?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-08 Thread Dimitri Fontaine
Le mardi 08 avril 2008, Tom Lane a écrit :
> Dimitri Fontaine <[EMAIL PROTECTED]> writes:
> > And my main concern would still be left as-is, COPY wouldn't have any
> > facility to cope with data representation not matching what datatype
> > input functions want to read.
>
> That's sufficiently covered by the proposal to allow a COPY FROM as a
> table source within SELECT, no?

Well, yes, the table source has text as datatypes and the select expression on 
the column will call whatever function/cast etc to do the work. But that 
opens the door to second class citizen storage solution for PostgreSQL, by 
letting the user CREATE VIEW atop of it:

  CREATE VIEW csv_storage AS
SELECT a, myfunc(b)::timestamp, c::int+3
  FROM (COPY ... FROM '/tmp/file.csv' ...) AS x(a, b, c)
 WHERE c ~ '^[0-9]+$';

What happens to the view when /tmp/file.csv is changed (new lines appended, or 
complete rewrite by another application, etc)?

Andrew comment is clear about it: he does not want PostgreSQL to offer this 
kind of support. I suppose it would be possible to stop CREATE VIEW to accept 
any form of SELECT, but I was hoping for my idea to get back some 
attractiveness at this point :)

At least I tried ;)
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] COPY Transform support

2008-04-07 Thread Tom Lane
Dimitri Fontaine <[EMAIL PROTECTED]> writes:
> And my main concern would still be left as-is, COPY wouldn't have any 
> facility 
> to cope with data representation not matching what datatype input functions 
> want to read.

That's sufficiently covered by the proposal to allow a COPY FROM as a
table source within SELECT, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-07 Thread Dimitri Fontaine
Le Monday 07 April 2008 21:04:26 Andrew Dunstan, vous avez écrit :
> Quite apart from any other reason why not, this would be a horrid hack
> and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is
> designed as a bulk load/unload facility. It's fragile enough in that role.

And my main concern would still be left as-is, COPY wouldn't have any facility 
to cope with data representation not matching what datatype input functions 
want to read. More often than not, you get this kind of data from another 
system, so in a deterministic way, and you want a chance to transform their 
representation just before PostgreSQL parses it.

May I try to insist on my second syntax proposal form:

  COPY destination_table(col1, col2, col3) 
 USING (field1, field2 || field3, myfunc(field4, field5))
  FROM 'file.txt'
  WITH ...

This allows for the case I'm concerned with to be taken care of, AND the other 
case pointed out by several posters on this thread too. N input fields, M 
stored columns, any line to row transformation (after same column splitting 
as of now), any column pre-parsing through SQL callable functions --- myfunc 
is called before feeding col3%ROWTYPE input function, e.g. 
And no support (that I see) for optional second storage system back door.

Comments?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] COPY Transform support

2008-04-07 Thread Andrew Dunstan



Decibel! wrote:

On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
Several years ago Bruce and I discussed the then theoretical use of a 
SELECT query as the source for COPY TO, and we agreed that the sane 
analog would be to have an INSERT query as the target of COPY FROM.


This idea seems to take that rather further. If doable I think it 
would be cool, as long as people don't try using it as an alternative 
storage engine. I can just imagine people creating views over such 
SELECT statements ...



Why not? There's certainly cases where doing just that could be very 
valuable. Storing older information that you're less likely to query 
comes to mind... in those cases you're going to be seqscanning anyway, 
so being able to read off a compact on-disk form is likely to be a win 
performance-wise. It could certainly be a win storage-wise.


If someone wants to look at syntax options, I'm pretty certain that 
Oracle supports this. IIRC you actually create what appears to the 
database to be a real table, except for restrictions on what you can 
actually do with it (for example, IIRC it's read-only).




You're serious aren't you?

Quite apart from any other reason why not, this would be a horrid hack 
and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is 
designed as a bulk load/unload facility. It's fragile enough in that role.


If we really want to support an alternative storage engine then we 
should tackle that front on and not via a back door like this.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-07 Thread Decibel!

On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
Several years ago Bruce and I discussed the then theoretical use of  
a SELECT query as the source for COPY TO, and we agreed that the  
sane analog would be to have an INSERT query as the target of COPY  
FROM.


This idea seems to take that rather further. If doable I think it  
would be cool, as long as people don't try using it as an  
alternative storage engine. I can just imagine people creating  
views over such SELECT statements ...



Why not? There's certainly cases where doing just that could be very  
valuable. Storing older information that you're less likely to query  
comes to mind... in those cases you're going to be seqscanning  
anyway, so being able to read off a compact on-disk form is likely to  
be a win performance-wise. It could certainly be a win storage-wise.


If someone wants to look at syntax options, I'm pretty certain that  
Oracle supports this. IIRC you actually create what appears to the  
database to be a real table, except for restrictions on what you can  
actually do with it (for example, IIRC it's read-only).

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] COPY Transform support

2008-04-04 Thread Sam Mason
On Thu, Apr 03, 2008 at 09:38:42PM -0400, Tom Lane wrote:
> Sam Mason <[EMAIL PROTECTED]> writes:
> > On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> >> I liked the idea of allowing COPY FROM to act as a table source in a
> >> larger SELECT or INSERT...SELECT.  Not at all sure what would be
> >> involved to implement that, but it seems a lot more flexible than
> >> any other approach.
> 
> > I'm not sure why new syntax is needed, what's wrong with having a simple
> > set of procedures like:
> >  readtsv(filename TEXT) AS SETOF RECORD
> 
> Yeah, I was thinking about that too.  The main stumbling block is that
> you need to somehow expose all of COPY's options for parsing an input
> line (CSV vs default mode, quote and delimiter characters, etc).

Guess why I chose a nice simple example!

> It's surely doable but it might be pretty ugly compared to bespoke
> syntax.

Yes, that's an easy way to get it looking pretty.

As an alternative solution, how about having some datatype that stores
these parameters.  E.g:

  CREATE TYPE copyoptions (
delimiter TEXT CHECK (delimiter <> ""),
nullstr   TEXT,
hasheader BOOLEAN,
quote TEXT,
escapeTEXT
  );

And have the input_function understand the current PG syntax for COPY
options.  You'd then be able to do:

  copyfrom('dummy.csv',$$ DELIMITER ';' CSV HEADER $$)

And the procedure would be able to pull out what it wanted from the
options.

> Another thing is that nodeFunctionScan.c is not really designed for
> enormous function result sets --- it dumps the results into a tuplestore
> whether that's needed or not.  This is a performance bug that we ought
> to address anyway, but we'd really have to fix it if we want to approach
> the COPY problem this way.  Just sayin'.

So you'd end up with something resembling a coroutine?  When would it
be good to actually dump everything into a tuplestore as it does at the
moment?

It'll be fun to see how much code breaks because it relies on the
current behaviour of a SRF running to completion without other activity
happening between!


  Sam

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Tom Lane
Sam Mason <[EMAIL PROTECTED]> writes:
> On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
>> I liked the idea of allowing COPY FROM to act as a table source in a
>> larger SELECT or INSERT...SELECT.  Not at all sure what would be
>> involved to implement that, but it seems a lot more flexible than
>> any other approach.

> I'm not sure why new syntax is needed, what's wrong with having a simple
> set of procedures like:
>  readtsv(filename TEXT) AS SETOF RECORD

Yeah, I was thinking about that too.  The main stumbling block is that
you need to somehow expose all of COPY's options for parsing an input
line (CSV vs default mode, quote and delimiter characters, etc).
It's surely doable but it might be pretty ugly compared to bespoke
syntax.

Another thing is that nodeFunctionScan.c is not really designed for
enormous function result sets --- it dumps the results into a tuplestore
whether that's needed or not.  This is a performance bug that we ought
to address anyway, but we'd really have to fix it if we want to approach
the COPY problem this way.  Just sayin'.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Sam Mason
On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> Gregory Stark <[EMAIL PROTECTED]> writes:
> > AFAIK the state of the art is actually to load the data into a table which
> > closely matches the source material, sometimes just columns of text. Then 
> > copy
> > it all to another table doing transformations. Not impressed.
> 
> I liked the idea of allowing COPY FROM to act as a table source in a
> larger SELECT or INSERT...SELECT.  Not at all sure what would be
> involved to implement that, but it seems a lot more flexible than
> any other approach.

I'm not sure why new syntax is needed, what's wrong with having a simple
set of procedures like:

 readtsv(filename TEXT) AS SETOF RECORD

You'd then be free to do whatever "transformations" you wanted:

  INSERT INTO table (i,j)
SELECT i, MIN(j::INTEGER)
FROM readtsv("file.dat") x(i INTEGER, j TEXT)
WHERE j ~ '^[0-9]+$'
GROUP BY i;

You could even have a readlines(filename) procedure that just gives you
back a SETOF TEXT and you can do the parsing yourself.  An associated
regexp split to RECORD would be nice then.


  Sam

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Andrew Dunstan



Tom Lane wrote:

Gregory Stark <[EMAIL PROTECTED]> writes:
  

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.



I liked the idea of allowing COPY FROM to act as a table source in a
larger SELECT or INSERT...SELECT.  Not at all sure what would be
involved to implement that, but it seems a lot more flexible than
any other approach.


  


Several years ago Bruce and I discussed the then theoretical use of a 
SELECT query as the source for COPY TO, and we agreed that the sane 
analog would be to have an INSERT query as the target of COPY FROM.


This idea seems to take that rather further. If doable I think it would 
be cool, as long as people don't try using it as an alternative storage 
engine. I can just imagine people creating views over such SELECT 
statements ...


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Tom Lane
Gregory Stark <[EMAIL PROTECTED]> writes:
> AFAIK the state of the art is actually to load the data into a table which
> closely matches the source material, sometimes just columns of text. Then copy
> it all to another table doing transformations. Not impressed.

I liked the idea of allowing COPY FROM to act as a table source in a
larger SELECT or INSERT...SELECT.  Not at all sure what would be
involved to implement that, but it seems a lot more flexible than
any other approach.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes:

> Dimitri Fontaine <[EMAIL PROTECTED]> writes:
>> Here's a proposal for COPY to support the T part of an ETL, that is adding 
>> the 
>> capability for COPY FROM to Transform the data it gets.
>
>> The idea is quite simple: adding to COPY FROM the option to run a function 
>> on 
>> the data before to call datatype_in functions.
>
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.
>
> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything.  The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.  What
> would makes sense in my mind is a single function taking and returning
> text, which is invoked once on each complete input line before it is
> broken into fields.

I think not having to deal with separating fields is actually one of the few
reasons to do this within COPY. If you can separate out yourself or need to do
something more clever than COPY is capable of to split the columns then you're
better off preprocessing it with perl or something anyways.

To that end all the other use cases you describe could be handled with his
plan. There's nothing stopping you from doing

CREATE READER foo (a integer, b integer)
INSERT INTO b (SELECT a+b FROM foo);
or
INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo)

However I'm not sure we even need new syntax for CREATE READER. I would think
something like this would make more sense:

CREATE FUNCTION transform(integer, integer) RETURNS SETOF b;

COPY b FROM 'foo' USING transform(integer,integer);

> So the whole thing seems just marginally attractive to me.

Everything about ETL is only marginally attractive, but it's something people
spend a lot of time doing. Nobody's come up with any particularly clean
solutions I think.

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Dimitri Fontaine
Le jeudi 03 avril 2008, Tom Lane a écrit :
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.

Well, when COLUMN x CONVERT USING or whatever syntax we choose is not used, we 
default to current code path, that is we do not mess with data content at all 
before to consider it's valid input syntax for target table datatypes.

And the syntax check is done only once, before beginning to read the data 
lines from the file.

> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything.  The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.

Yes, this is another need, but actually better solved, in my opinion, with 
loading data into a (temp) loadtable then process it with SQL:
  INSERT INTO destination_table SELECT whatever FROM loadtable;

The problem I'm trying to solve is not this one, I'm trying to have COPY able 
to load data into a table when the representation of it we have into the file 
does not match what datatype input function expects.

An example might help us talking about the same thing. mysqldump CSV outputs 
timestamp sometimes (depending on server version) as '20041002152952' when 
PostgreSQL expects '2004-10-02 15:29:52'. I'd like COPY to be able to cope 
with this situation.

Now, another syntax proposal could have both the needs solved. We basically 
need to be able to transform input fields and process them into input 
columns, in a way that N input fields (found in the data file) will get us M 
input columns:

  COPY destination_table(col1, col2, col3, col4) 
 USING (field1, field2 || field3, myfunc(field4, field5))
  FROM 'file.txt'
  WITH ...

This could get better than preprocessing then COPY then INSERT INTO ... SELECT 
because we don't need a temp table (don't need to care about its name being 
unique, nor to mess up with temp_buffers), etc. 
You're the one able to tell why it'll be better to have one COPY command 
instead of a two table steps load, I'm just guessing ;)

And if it's better for the user to preprocess in perl then COPY, he still has 
the option.
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] COPY Transform support

2008-04-03 Thread PFC

INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
'-00-00'   ), ... FROM mydump WHERE (FKs check and drop the borken
records);


What do we gain against current way of doing it, which is:
  COPY loadtable FROM 'dump.txt' WITH ...
  INSERT INTO destination_table(...) SELECT ... FROM loadtable;


	You read and write the data only once instead of twice (faster) if you  
want to import all of it.
	If you just want to compute some aggregates and store the results in a  
table, you just read the data once and don't write it at all.


	The advantages are the same than your proposed transformations to COPY,  
except I feel this way of doing it opens more options (like, you can  
combine columns, check FKs at load, do queries on data without loading it,  
don't necessarily have to insert the data in a table, don't have to invent  
a new syntax to express the transformations, etc).







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Csaba Nagy
On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
> CREATE FLATFILE READER mydump (
>   id  INTEGER,
>   dateTEXT,
>   ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;

Very cool idea, but why would you need to create a reader object
first ? You should be able to use COPY directly with the target table
being omitted,  meaning the copy will not pump it's result in the target
but be equivalent to a select... and use it in any place where a select
can be used. This would have absolutely no new  syntax, just the rules
changed... 

Now that I had a second look you actually need the field definitions to
meaningfully interpret the file, but then why not use a record
specification instead of the table in the normal COPY command ? I'm not
sure if there's any existing syntax for that but I would guess yes...

In any case, such a feature would help a lot in processing input files
based also on other existing data in the DB.

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread PFC

On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <[EMAIL PROTECTED]> wrote:


On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:

CREATE FLATFILE READER mydump (
id  INTEGER,
dateTEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;


Very cool idea, but why would you need to create a reader object
first ? You should be able to use COPY directly with the target table
being omitted,  meaning the copy will not pump it's result in the target
but be equivalent to a select... and use it in any place where a select
can be used. This would have absolutely no new  syntax, just the rules
changed...

Now that I had a second look you actually need the field definitions to
meaningfully interpret the file,


	Yeah, you need to tell Postgres the field names, types, and NULLness  
before it can parse them... or else it's just a plain flat text file which  
makes no sense...

but then why not use a record
specification instead of the table in the normal COPY command ? I'm not
sure if there's any existing syntax for that but I would guess yes...


	Hm, yeah, that's even simpler, just create a type for the row (or just  
use table%ROWTYPE if you have a table that fits the description), and tell  
COPY to parse according to the row type definition... smart...


Like :

CREATE TYPE import_rowtype AS (id INTEGER, date TEXT);
INSERT INTO mytable (id, date, ...)
  SELECT id, NULLIF( date, '-00-00' )::DATE
  FROM (COPY AS import_rowtype FROM 'mysql_trash.txt') AS foo
  WHERE (FKs check and drop the borken records);

Looks clean...

	Obviously, in this case (and also in my proposal's case) you must use  
COPY and not \copy since it is the database server which will be reading  
the file.
	This could probably be hacked so the client sends the file via the \copy  
interface, too...



In any case, such a feature would help a lot in processing input files
based also on other existing data in the DB.


Yeah, it would be cool.
	Also, since COPY TO can use a SELECT as a data source, you could use  
postgres to read from a file/pipe, process data, and write to a file/pipe  
(kinda better than sed, lol)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Dimitri Fontaine
Le jeudi 03 avril 2008, PFC a écrit :
> CREATE FLATFILE READER mydump (
> id  INTEGER,
> dateTEXT,
> ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;
[...]
> INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
> '-00-00'   ), ... FROM mydump WHERE (FKs check and drop the borken
> records);

What do we gain against current way of doing it, which is:
  COPY loadtable FROM 'dump.txt' WITH ...
  INSERT INTO destination_table(...) SELECT ... FROM loadtable; 

-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] COPY Transform support

2008-04-03 Thread Tom Lane
Dimitri Fontaine <[EMAIL PROTECTED]> writes:
> Here's a proposal for COPY to support the T part of an ETL, that is adding 
> the 
> capability for COPY FROM to Transform the data it gets.

> The idea is quite simple: adding to COPY FROM the option to run a function on 
> the data before to call datatype_in functions.

The major concern I have about this is to ensure that no detectable
overhead is added to COPY when the feature isn't being used.

I am not actually convinced that the column-by-column design you seem to
have in mind is worth anything.  The examples that I remember seeing
often involve removing columns, generating one column from multiple ones
or vice versa, dealing with nonstandard column delimiters, etc.  What
would makes sense in my mind is a single function taking and returning
text, which is invoked once on each complete input line before it is
broken into fields.

This is, of course, just a substitute for running a sed or perl or
similar script over the data before feeding it to COPY --- and probably
not an amazingly good substitute at that.  For instance, assuming you
like perl for text-wrangling, I'd fully expect the function approach
to be slower than an external script because of the large overhead of
getting into and out of libperl for each line,

In situations where it's actually useful to apply SQL functions rather
than text-mangling operations to the data, you always have the option to
COPY into a temp table and then do INSERT/SELECT from there.

So the whole thing seems just marginally attractive to me.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread PFC


Data transformation while doing a data load is a requirement now and  
then.

Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right  
way we

should support the following:



* The ability to provide per-column transformation expressions
* The ability to use any kind of expressions while doing the  
transformation

The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and  
obviously

a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too
* The transformation expression can refer to other columns involved in  
the

load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.
(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)
I have spent some thoughts on how to do this and will be happy to share  
the

same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.



	Well, since COPY is about as fast as INSERT INTO ... SELECT plus the  
parsing overead, I suggest adding a special SELECT form that can read from  
a file instead of a table, which returns tuples, and which therefore can  
be used and abused to the user's liking. This is a much more powerful  
feature because :


- there is almost no new syntax
- it is much simpler for the user
- lots of existing stuff can be leveraged

EXAMPLE :

	Suppose I want to import a MySQL dump file (gasp !) which obviously  
contains lots of crap like -00-00 dates, '' instead of NULL, borken  
foreign keys, etc.


Let's have a new command :

CREATE FLATFILE READER mydump (
id  INTEGER,
dateTEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

	This command would create a set-returning function which is basically a  
wrapper around the existing parser in COPY.
	Column definition gives a name and type to the fields in the text file,  
and tells the parser what to expect and what to return.
	It looks like a table definition, and this is actually pretty normal : it  
is, after all, very close to a table.


	INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '-00-00'  
), ... FROM mydump WHERE (FKs check and drop the borken records);


	Now I can import data and transform it at will using a simple SELECT. The  
advantage is that everybody will know what to do without learning a new  
command, no awkward syntax (transform...), you can combine columns in  
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,  
anything you want, without any extension to the Postgres engine besides  
the creation of this file-parsing set-returning function, which should be  
pretty simple.


	Or, if I have a few gigabytes of logs, but I am absolutely not interested  
in inserting them into a table, instead I want to make some statistics, or  
perhaps I want to insert into my table some aggregate computation from  
this data, I would just :


CREATE FLATFILE READER accesses_dump (
dateTEXT,
ip  INET,
...
) FROM file 'web_server_logtxt';

And I can do some stats without even loading the data :

	SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)  
HAVING count(*) > 1000;


	Much better than having to load those gigabytes just to make a query on  
them...
































--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY Transform support

2008-04-03 Thread NikhilS
Hi,


On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine <[EMAIL PROTECTED]>
wrote:

> Here's a proposal for COPY to support the T part of an ETL, that is adding
> the
> capability for COPY FROM to Transform the data it gets.
>
> The idea is quite simple: adding to COPY FROM the option to run a function
> on
> the data before to call datatype_in functions. This needs some syntax
> addition to be worked out at the COPY side, then the COPY code will have
> to
> run the given function on the read data and consider giving the output of
> it
> to current COPY code (datatype input function).
>
> The function could either get the data as text or bytea, and would have to
> return either text or bytea. bytea seems the more sensible choice, as long
> as
> we don't lose encoding information there, which I'm not sure about.
>
> The syntax could be something like:
>  COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc;
>
> I tried to only add keywords already present in [1], while getting
> something
> meaningfull... and x is intended to be the column number, counting from 1.
>  [1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html
>
> Comments?
> --
> dim
>

+1

Data transformation while doing a data load is a requirement now and then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right way we
should support the following:


* The ability to provide per-column transformation expressions

COPY mytable (col1 transform to "col1 + 10", col2 transform to "'Post' ||
'greSQL'", col3...) FROM ..

* The ability to use any kind of expressions while doing the transformation
The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and obviously
a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too

* The transformation expression can refer to other columns involved in the
load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.

(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)

I have spent some thoughts on how to do this and will be happy to share the
same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.

Regards,
Nikhils
-- 
EnterpriseDB http://www.enterprisedb.com