Re: [HACKERS] COPY enhancements

2009-10-20 Thread Emmanuel Cecchet

Tom,

Emmanuel Cecchet m...@asterdata.com writes:
  

Tom Lane wrote:


There aren't any.  You can *not* put a try/catch around arbitrary code
without a subtransaction.  Don't even think about it

Well then why the tests provided with the patch are working?


Because they carefully exercise only a tiny fraction of the system.
The key word in my sentence above is arbitrary.  You don't know what
a datatype input function might try to do, let alone triggers or other
functions that COPY might have to invoke.  They might do things that
need to be cleaned up after, and subtransaction rollback is the only
mechanism we have for that.
  
Is it possible to use the try/catch mechanism to detect errors and log 
them and only abort the command at the end of the processing?
This would have the advantage of being to be able to log all errors 
without the overhead of subtransactions and even add some additional 
information on where the error happened (parsing, constraints, trigger, 
...) for further automated treatment. The result would still be clean 
since we would abort the COPY command in case of an error as it does 
currently (but we would not stop on the first error).
I guess that it would make more sense to log to a file than to a table 
in that case.


Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-20 Thread Tom Lane
Emmanuel Cecchet m...@asterdata.com writes:
 Tom Lane wrote:
 The key word in my sentence above is arbitrary.  You don't know what
 a datatype input function might try to do, let alone triggers or other
 functions that COPY might have to invoke.  They might do things that
 need to be cleaned up after, and subtransaction rollback is the only
 mechanism we have for that.

 Is it possible to use the try/catch mechanism to detect errors and log 
 them and only abort the command at the end of the processing?

No, I wouldn't trust that.  The point here is that various backend
subsystems (eg, locks, buffers) might need to be cleaned up after an
error before they can be used further.  It might look like it works,
until you stumble across the cases where it doesn't.  An example of the
sort of situation I'm concerned about is somebody throwing an elog
while holding a buffer lock.  If you try to keep processing and clean
up later, it will work fine ... until the day comes that the subsequent
processing chances to try to lock that same buffer again, and then the
backend will freeze up.

We've got twenty years worth of code development that assumes that
transaction abort will clean up anything that was left hanging when
an error was thrown.  It was difficult enough getting that to work
for subtransaction abort as well as main transaction abort.  It's
not happening for anything less than a subtransaction abort.

This approach is a good tradeoff most of the time: the code is simpler,
much more reliable, and faster in the normal case than it would be if
we tried to do post-error cleanup differently.  Error-tolerant COPY
is going to pay for all that, but we're not revisiting the decision.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-19 Thread Alvaro Herrera
Gokulakannan Somasundaram escribió:

 Actually this problem is present even in today's transaction id scenario and
 the only way we avoid is by using freezing. Can we use a similar approach?
 This freezing should mean that we are freezing the sub-transaction in order
 to avoid the sub-transaction wrap around failure.

This would mean we would have to go over the data inserted by the
subtransaction and mark it as subxact frozen.  Some sort of sub-vacuum
if you will (because it obviously needs to work inside a transaction).
Doesn't sound real workable to me.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-19 Thread Robert Haas
On Mon, Oct 19, 2009 at 11:21 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Gokulakannan Somasundaram escribió:

 Actually this problem is present even in today's transaction id scenario and
 the only way we avoid is by using freezing. Can we use a similar approach?
 This freezing should mean that we are freezing the sub-transaction in order
 to avoid the sub-transaction wrap around failure.

 This would mean we would have to go over the data inserted by the
 subtransaction and mark it as subxact frozen.  Some sort of sub-vacuum
 if you will (because it obviously needs to work inside a transaction).
 Doesn't sound real workable to me.

Especially because the XID consumed by the sub-transaction would still
be consumed, advancing the global XID counter.  Reclaiming the XIDs
after the fact doesn't fix anything as far as I can see.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-18 Thread Gokulakannan Somasundaram
Actually i thought of a solution for the wrap-around sometime back. Let me
try to put my initial thoughts into it. May be it would get refined over
conversation.

Transaction wrap-around failure

Actually this problem is present even in today's transaction id scenario and
the only way we avoid is by using freezing. Can we use a similar approach?
This freezing should mean that we are freezing the sub-transaction in order
to avoid the sub-transaction wrap around failure. I think when we use a
sub-transaction, the tuple would have xmin as the sub-transaction id(correct
me, if i am wrong). If the tuple insertion becomes successful, we will make
it equal to the parent transaction id. This way, we get a chance to re-use
the sub-transaction id, previously used. This would mean accessing the tuple
again after the sub-transaction completes

On the recovery front, the freezing should get logged into redo. With this
approach, we need only one sub-transaction id for the entire copy. If insert
gets successful for a copied row, we goto the tuple again and sub-freeze a
tuple. If it gets un-successful, we rollback the sub-transaction. But for
every un-successful transaction, we need a transaction id. May be in order
to avoid this, we can have one transaction id to mark the failures and
freeze all the failed rows for that transaction id.

I am just throwing out an idea for consideration.

Thanks,
Gokul.


Re: [HACKERS] COPY enhancements

2009-10-13 Thread Emmanuel Cecchet

Tom Lane wrote:

Ultimately, there's always going to be a tradeoff between speed and
flexibility.  It may be that we should just say if you want to import
dirty data, it's gonna cost ya and not worry about the speed penalty
of subtransaction-per-row.  But that still leaves us with the 2^32
limit.  I wonder whether we could break down COPY into sub-sub
transactions to work around that...
  
Regarding that tradeoff between speed and flexibility I think we could 
propose multiple options:

- maximum speed: current implementation fails on first error
- speed with error logging: copy command fails if there is an error but 
continue to log all errors
- speed with error logging best effort: no use of sub-transactions but 
errors that can safely be trapped with pg_try/catch (no index violation, 
no before insert trigger, etc...) are logged and command can complete
- pre-loading (2-phase copy): phase 1: copy good tuples into a [temp] 
table and bad tuples into an error table. phase 2: push good tuples to 
destination table. Note that if phase 2 fails, it could be retried since 
the temp table would be dropped only on success of phase 2.
- slow but flexible: have every row in a sub-transaction - is there any 
real benefits compared to pg_loader?


Tom was also suggesting 'refactoring COPY into a series of steps that 
the user can control'. What would these steps be? Would that be per row 
and allow to discard a bad tuple?


Emmanuel

--
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development  Consulting

--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-13 Thread Tom Lane
Emmanuel Cecchet m...@frogthinker.org writes:
 - speed with error logging best effort: no use of sub-transactions but 
 errors that can safely be trapped with pg_try/catch (no index violation, 

There aren't any.  You can *not* put a try/catch around arbitrary code
without a subtransaction.  Don't even think about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-13 Thread Emmanuel Cecchet

Tom Lane wrote:

Emmanuel Cecchet m...@frogthinker.org writes:
  
- speed with error logging best effort: no use of sub-transactions but 
errors that can safely be trapped with pg_try/catch (no index violation, 



There aren't any.  You can *not* put a try/catch around arbitrary code
without a subtransaction.  Don't even think about it.
  

Well then why the tests provided with the patch are working?
I hear you when you say that it is not a generally applicable idea, but 
it seems that at least a couple of errors can be trapped with this 
mechanism.


Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-13 Thread Dimitri Fontaine
Emmanuel Cecchet m...@frogthinker.org writes:
 Tom was also suggesting 'refactoring COPY into a series of steps that the
 user can control'. What would these steps be? Would that be per row and
 allow to discard a bad tuple?

The idea is to have COPY usable from a general SELECT query so that the
user control what happens. Think of an SRF returning bytea[] or some
variation on the theme.

Maybe WITH to the rescue:

  WITH csv AS (
-- no error here as the destination table is in memory tuple store,
-- assuming we have adunstan patch to ignore rows with too few or
-- too many columns
COPY csv(a, b, c, d) FROM STDIN WITH CSV HEADER --- and said options
  )
  INSERT INTO destination
   SELECT a, b, f(a + b - d), strange_timestamp_reader(c)
 FROM csv
WHERE validity_check_passes(a, b, c, d);

That offers complete control to the user about the stages that transform
the data. In a previous thread some ideas I forgot the details offered
to the users some more control, but I don't have the time right now to
search in the archives.

Regards,
-- 
Dimitri Fontaine
PostgreSQL DBA, Architecte

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-13 Thread Tom Lane
Emmanuel Cecchet m...@asterdata.com writes:
 Tom Lane wrote:
 There aren't any.  You can *not* put a try/catch around arbitrary code
 without a subtransaction.  Don't even think about it.
 
 Well then why the tests provided with the patch are working?

Because they carefully exercise only a tiny fraction of the system.
The key word in my sentence above is arbitrary.  You don't know what
a datatype input function might try to do, let alone triggers or other
functions that COPY might have to invoke.  They might do things that
need to be cleaned up after, and subtransaction rollback is the only
mechanism we have for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-12 Thread Simon Riggs
On Thu, 2009-10-08 at 11:01 -0400, Tom Lane wrote:

 So as far as I can see, the only form of COPY error handling that
 wouldn't be a cruel joke is to run a separate subtransaction for each
 row, and roll back the subtransaction on error.  Of course the
 problems
 with that are (a) speed, (b) the 2^32 limit on command counter IDs
 would mean a max of 2^32 rows per COPY, which is uncomfortably small
 these days.  Previous discussions of the problem have mentioned trying
 to batch multiple rows per subtransaction to alleviate both issues.
 Not easy of course, but that's why it's not been done yet.  With a
 patch like this you'd also have (c) how to avoid rolling back the
 insertions into the logging table.

(d) using too many xids will force the system to begin immediate
wraparound-avoidance vacuuming to freeze rows. 

Dimitri's pgloader is looking even more attractive, not least because it
exists and it works. (And is the reason I personally stopped considering
the COPY-error-logging feature as important).

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Simon Riggs
On Fri, 2009-10-09 at 00:15 +0100, Simon Riggs wrote:
 On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
  
  You'd eat a sub-sub-transaction per row, and start a new sub-transaction
  every 2^32 rows.
  
  However, on second thought this really doesn't get us anywhere, it just
  moves the 2^32 restriction somewhere else.  Once the outer transaction
  gets to be more than 2^31 XIDs old, the database is going to stop
  because of XID wraparound.
  
  So really we have to find some way to only expend one XID per failure,
  not one per row.
 
 I discovered a few days back that ~550 subtransactions is sufficient to
 blow max_stack_depth. 1 subtransaction per error doesn't allow many
 errors.

Not meaning to come up with problems, nor direct them at Tom, this is
just a convenient place to put in a few thoughts.

Another thing that has occurred to me is that RI checks are currently
resolved at end of statement and could end up rejecting any/all rows
loaded. If we break down the load into subtransaction pieces we would
really want the RI checks on the rows to be performed during the
subtransaction that makes them. The good thing about that is that it
would lend itself to holding successful checks in a hash table to allow
a fast path optimization of continual re-checks of same values.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Hannu Krosing
On Thu, 2009-10-08 at 11:32 -0400, Robert Haas wrote:

 Another possible approach, which isn't perfect either, is the idea of
 allowing COPY to generate a single column of output of type text[].
 That greatly reduces the number of possible error cases, 

maybe make it bytea[] to further reduce error cases caused by charset
incompatibilities ?

 and at least
 gets the data into the DB where you can hack on it.  But it's still
 going to be painful for some use cases.
 
 ...Robert

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
 So really we have to find some way to only expend one XID per failure,
 not one per row.

 I discovered a few days back that ~550 subtransactions is sufficient to
 blow max_stack_depth. 1 subtransaction per error doesn't allow many
 errors.

I assume you are talking about 550 nested levels, not 550 sequential
subtransactions, so that doesn't seem particularly relevant.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Another thing that has occurred to me is that RI checks are currently
 resolved at end of statement and could end up rejecting any/all rows
 loaded. If we break down the load into subtransaction pieces we would
 really want the RI checks on the rows to be performed during the
 subtransaction that makes them. The good thing about that is that it
 would lend itself to holding successful checks in a hash table to allow
 a fast path optimization of continual re-checks of same values.

If we did that it would guarantee that cases like self-referential RI
would fail.  (Think parent-child links in a tree, for example.)
I see the point about wishing that such checks would be part of the
per-row error handling, but we can't just unconditionally do things
that way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Tom Lane
Hannu Krosing ha...@2ndquadrant.com writes:
 On Thu, 2009-10-08 at 11:32 -0400, Robert Haas wrote:
 Another possible approach, which isn't perfect either, is the idea of
 allowing COPY to generate a single column of output of type text[].
 That greatly reduces the number of possible error cases, 

 maybe make it bytea[] to further reduce error cases caused by charset
 incompatibilities ?

That seems likely to be considerably less convenient and more error
prone, as it now puts it on the user to do the correct conversion.

It does bring up an interesting point for error handling though, which
is what do we do with rows that fail encoding conversion?  For logging
to a file we could/should just decree that we write out the original,
allegedly-in-the-client-encoding data.  I'm not sure what we do about
logging to a table though.  The idea of storing bytea is pretty
unpleasant but there might be little choice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-09 Thread Greg Smith

On Fri, 9 Oct 2009, Tom Lane wrote:

what do we do with rows that fail encoding conversion?  For logging to a 
file we could/should just decree that we write out the original, 
allegedly-in-the-client-encoding data.  I'm not sure what we do about 
logging to a table though.  The idea of storing bytea is pretty 
unpleasant but there might be little choice.


I think this detail can get punted as documented and the error logged, but 
not actually handled perfectly.  In most use cases I've seen here, saving 
the rows to the reject file/table is a convenience rather than a hard 
requirement anyway.  You can always dig them back out of the original 
again if you see an encoding error in the logs, and it's rare you can 
completely automate that anyway.


The main purpose of the reject file/table is to accumulate things you 
might fix by hand or systematic update (i.e. add ,\N for a missing 
column when warranted) before trying a re-import for review.  I suspect 
the users of this feature would be OK with knowing that can't be 100% 
accurate in the face of encoding errors.  It's more important that in the 
usual case, things like bad delimiters and missing columns, that you can 
easily manipulate the rejects as simple text.  Making that harder just for 
this edge case wouldn't match the priorities of the users of this feature 
I've encountered.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 What's really bad about this is that a flag called error_logging is
 actually changing the behavior of the command in a way that is far
 more dramatic than (and doesn't actually have much to do with) error
 logging.  It's actually making a COPY command succeed that would
 otherwise have failed.  That's not just a logging change.

That's what has been asked for, a COPY that is able to load the good
rows in the presence of bad rows. I'd rather change the option name than
the behavior. Pretty please.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 4:42 AM, Dimitri Fontaine dfonta...@hi-media.com wrote:
 Robert Haas robertmh...@gmail.com writes:
 What's really bad about this is that a flag called error_logging is
 actually changing the behavior of the command in a way that is far
 more dramatic than (and doesn't actually have much to do with) error
 logging.  It's actually making a COPY command succeed that would
 otherwise have failed.  That's not just a logging change.

 That's what has been asked for, a COPY that is able to load the good
 rows in the presence of bad rows. I'd rather change the option name than
 the behavior. Pretty please.

I'm a little mystified by this response since I spent several
paragraphs following the one that you have quoted here explaining how
I think we should approach the problem of providing the features that
are currently all encapsulated under the mantle of error logging.  I
don't think changing the name is going to be sufficient by itself,
but, well, go back and read what I suggested (and comment on it if you
have any thoughts or just want to say +/-1).

Lest there be any unclarity, I am NOT trying to shoot down this
feature with my laser-powered bazooka.  What I AM trying to do is
point out - as early as possible - things that I believe that a
hypothetical committer would likely also object to.  That's the major
point of having non-committer reviewers, at least AIUI.  I am not
opposed to the underlying features except to the extent that they have
unfixable design problems.  I believe that they CURRENTLY have design
problems, but I'm hopeful that, with some discussion, we can agree on
a way forward.  I think, though, that we should try to keep the
discussion technical (how well does this work?) rather than a
referendum on the underlying feature (which someone might object to,
but the sheer level of interest in this patch is a fairly compelling
argument that there is support for at least some of what it is trying
to do).

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Simon Riggs

On Wed, 2009-10-07 at 22:30 -0400, Robert Haas wrote:
 On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet m...@asterdata.com wrote:
  Robert,
 
  Here is the new version of the patch that applies to CVS HEAD as of this
  morning.
 
  Emmanuel
 
 I took a look at this patch tonight and, having now read through some
 of it, I have some more detailed comments.
 
 It seems quite odd to me that when COPY succeeds but there are errors,
 the transaction commits.  The only indication that some of my data
 didn't end up in the table is that the output says COPY n where n is
 less than the total number of rows I attempted to copy.  On the other
 hand, it would be equally bad if the transaction aborted, because then
 my error logging table would not get populated - I note that that's
 the behavior we do get if the max errors threshold is exceeded.  I'm
 not sure what the right answer is here, but it needs some thought and
 discussion.  I think at a minimum the caller needs some indication of
 the number of FAILED rows, not just the number of succesful ones.
 
 What's really bad about this is that a flag called error_logging is
 actually changing the behavior of the command in a way that is far
 more dramatic than (and doesn't actually have much to do with) error
 logging.  It's actually making a COPY command succeed that would
 otherwise have failed.  That's not just a logging change.
 
 I think the underlying problem is that there is more than one feature
 here, and they're kind of mixed together.  The fact that the
 partitioning code needs to be separated from the error logging stuff
 has already been discussed, but I think it actually needs to be broken
 down even further.  I think there are three separate features in the
 error logging code:
 
 A. Ability to ignore a certain number of errors (of certain types?)
 and still get the other tuples into the table anyway.
 B. Ability to return error information in a structured format rather
 than as an exception message.
 C. Ability to direct the structured error messages to a table.
 
 Each of those features deserves a separate discussion to decide
 whether we want it and how best to implement it.  Personally, I think
 we should skip (C), at least as a starting point.  Instead of logging
 to a table, I think we should consider making COPY return the tuples
 indicating the error as a query result, the same way EXPLAIN returns
 the query plan.  It looks like this would eliminate the need for at
 least three of the new COPY options without losing much functionality.
 
 I also think that, if possible, (A) and (B) should be implemented as
 separate features, so that each one can be used separately or both can
 be used in combination.

I don't see any logical issues with what has been proposed. Current COPY
allows k=0 cumulative errors before it aborts. This patch allows us to
provide a parameter to vary k. If we get nerrors  k then the COPY
should abort. If nerrors = k then the COPY can commit. Exactly what we
need. Perhaps the parameter should simply be called max_errors rather
than error_logging, so the meaning is clear.

We must have detailed error reporting for each row individually,
otherwise we will not be able to correct the errors. 125346 rows loaded,
257 errors is not really useful information, so I wouldn't bother trying
to change the return info to supply the number of errors. If you defined
an error file you know you need to check it. If it doesn't exist, no
errors. Simple.

For me, A and B are inseparable. Dimitri's idea to have an error file
and a error reason file seems very good to me.

C may be important because of security concerns regarding writing error
files but it isn't critical in first commit.

(Not really in reply to Robert: Most database loaders define
max_num_bad_rows as a percentage of the size of the file. Otherwise you
need to read the file to see how big it is before you decide an
appropriate setting, which if it is very big is a bad idea).

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 I'm a little mystified by this response since I spent several
 paragraphs following the one that you have quoted here explaining how
 I think we should approach the problem of providing the features that
 are currently all encapsulated under the mantle of error logging.

Yeah sorry I was to quick to answer. Basically having the bad rows
returned to the client the same way EXPLAIN does feels strange. Not very
scalable and sounds uneasy to manage client side...

So it feels to me like when you talk about postponing the bad lines
rejection handling you're in fact postponing it all, as that's the thing
we're wanting this patch for. I couldn't really parse if your proposal
is a step towards having a better rejection handling, though.

Hope this clarifies it, sorry again for bad try at being terse,

Regards,
-- 
Dimitri Fontaine
PostgreSQL DBA, Architecte

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 8:34 AM, Dimitri Fontaine dfonta...@hi-media.com wrote:
 Robert Haas robertmh...@gmail.com writes:
 I'm a little mystified by this response since I spent several
 paragraphs following the one that you have quoted here explaining how
 I think we should approach the problem of providing the features that
 are currently all encapsulated under the mantle of error logging.

 Yeah sorry I was to quick to answer. Basically having the bad rows
 returned to the client the same way EXPLAIN does feels strange. Not very
 scalable and sounds uneasy to manage client side...

I was thinking of returning the error messages, rather than the rows,
same as what is getting logged to the side table by the current patch.

As for table vs. result-set, let me just say that a patch that returns
result-set will be easier and more likely to be committed, and a
follow-on patch can add a feature to redirect it to a table (or maybe
we'll come up with another solution to that part, like WITH
copy_results AS (COPY ...) INSERT INTO ... SELECT ... FROM
copy_results), which would actually be far more powerful and with many
fewer definitional challenges).

 So it feels to me like when you talk about postponing the bad lines
 rejection handling you're in fact postponing it all, as that's the thing
 we're wanting this patch for. I couldn't really parse if your proposal
 is a step towards having a better rejection handling, though.

 Hope this clarifies it, sorry again for bad try at being terse,

Well, right now we are postponing EVERYTHING, because there is a week
left in the CommitFest and this patch isn't close to being
committable.  The next time someone takes a crack at this, IMHO, it
should be broken into significantly smaller pieces.  Exactly which of
those pieces gets done first is up to the patch author, of course, but
I don't see why someone couldn't have a workable patch with an
interesting subset of this functionality done in time for next CF,
especially given this code to hack on for a starting point.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: 
 
 It seems quite odd to me that when COPY succeeds but there are
 errors, the transaction commits.  The only indication that some of
 my data didn't end up in the table is that the output says COPY n
 where n is less than the total number of rows I attempted to copy. 
 On the other hand, it would be equally bad if the transaction
 aborted, because then my error logging table would not get populated
 - I note that that's the behavior we do get if the max errors
 threshold is exceeded.  I'm not sure what the right answer is here,
 but it needs some thought and discussion.  I think at a minimum the
 caller needs some indication of the number of FAILED rows, not just
 the number of succesful ones.
 
When the COPY fails due to a high error count, we should be able to
determine:
 
(1)  How many rows were read.
(2)  How many of the rows read had errors.
(3)  Images of failed rows with errors found on each.
 
On success, we need the above, plus the failed rows in a format suable
for editing and re-applying as needed.
 
 Instead of logging to a table, I think we should consider making
 COPY return the tuples indicating the error as a query result, the
 same way EXPLAIN returns the query plan.
 
This seems attractive, particularly if it can be fed to an INSERT
statement (like INSERT ... SELECT does).  The only problem I see with
it is that PostgreSQL seems to not want to return rows from statements
which fail.  (I know this is generally considered a feature, but it
sometimes has unfortunate consequences.)  Is there a way to satisfy
(3) above on a failed COPY statement if we go this route?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Lest there be any unclarity, I am NOT trying to shoot down this
 feature with my laser-powered bazooka.

Well, if you need somebody to do that --- I took a quick look through
this patch, and it is NOT going to get committed.  Not in anything
approximately like its current form.  The questions about how the
logging should act don't come anywhere near addressing the fundamental
problem with the patch, which is that IT DOESN'T WORK.  You can *not*
suppose that you can just put a PG_TRY block around some processing
and catch any random error and keep going.  I see that the patch tries
to avoid this by only catching certain major errcode categories, which
merely makes it useless while still being untrustworthy.

As an example of a case that anyone would expect to work that cannot
work with this approach, I submit unique-index violations.  When the
index throws the error, the bad row has already been inserted in the
table (and maybe some other indexes too).  The *only* way to clean up
is to abort the transaction/subtransaction so that the row will not be
considered good.

More generally, since we are calling user-defined BEFORE INSERT triggers
in there, we have to assume that absolutely anything at all could have
been done by the triggers.  PG_CATCH doesn't even pretend to cope with
that.

So as far as I can see, the only form of COPY error handling that
wouldn't be a cruel joke is to run a separate subtransaction for each
row, and roll back the subtransaction on error.  Of course the problems
with that are (a) speed, (b) the 2^32 limit on command counter IDs
would mean a max of 2^32 rows per COPY, which is uncomfortably small
these days.  Previous discussions of the problem have mentioned trying
to batch multiple rows per subtransaction to alleviate both issues.
Not easy of course, but that's why it's not been done yet.  With a
patch like this you'd also have (c) how to avoid rolling back the
insertions into the logging table.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Alvaro Herrera
Robert Haas escribió:

 Some defective part of my brain enjoys seeing things run smoothly more
 than it enjoys being lazy.

Strangely, that seems to say you'd make a bad Perl programmer, per Larry
Wall's three virtues.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 11:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Lest there be any unclarity, I am NOT trying to shoot down this
 feature with my laser-powered bazooka.

 Well, if you need somebody to do that

Well, I'm trying not to demoralize people who have put in hard work,
however much it may not be usable.  Still, your points are well taken.
 I did raise some of them (with a lot less technical detail) in my
review of last night.

 So as far as I can see, the only form of COPY error handling that
 wouldn't be a cruel joke is to run a separate subtransaction for each
 row, and roll back the subtransaction on error.  Of course the problems
 with that are (a) speed, (b) the 2^32 limit on command counter IDs
 would mean a max of 2^32 rows per COPY, which is uncomfortably small
 these days.  Previous discussions of the problem have mentioned trying
 to batch multiple rows per subtransaction to alleviate both issues.
 Not easy of course, but that's why it's not been done yet.  With a
 patch like this you'd also have (c) how to avoid rolling back the
 insertions into the logging table.

Yeah.  I think it's going to be hard to make this work without having
standalone transactions.  One idea would be to start a subtransaction,
insert tuples until one fails, then rollback the subtransaction and
start a new one, and continue on until the error limit is reached.  At
the end, if the number of rollbacks is  0, then roll back the final
subtransaction also.  This wouldn't have the property of getting the
unerrorred data into the table, but at least it would let you report
all the errors in a single pass, hopefully without being gratingly
slow.  Subcommitting every single row is going to be really painful,
especially after Hot Standby goes in and we have to issue a WAL record
after every 64 subtransactions (AIUI).

Another possible approach, which isn't perfect either, is the idea of
allowing COPY to generate a single column of output of type text[].
That greatly reduces the number of possible error cases, and at least
gets the data into the DB where you can hack on it.  But it's still
going to be painful for some use cases.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 11:29 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Robert Haas escribió:

 Some defective part of my brain enjoys seeing things run smoothly more
 than it enjoys being lazy.

 Strangely, that seems to say you'd make a bad Perl programmer, per Larry
 Wall's three virtues.

Don't worry.  I have a strong preference for them running smoothly
without me whenever possible.  It's only that my hubris and impatience
are capable of overriding my laziness from time to time.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Subcommitting every single row is going to be really painful,
 especially after Hot Standby goes in and we have to issue a WAL record
 after every 64 subtransactions (AIUI).

Yikes ... I had not been following that discussion, but that sure sounds
like a deal-breaker.  For HS, not this.  But back to topic:

 Another possible approach, which isn't perfect either, is the idea of
 allowing COPY to generate a single column of output of type text[].
 That greatly reduces the number of possible error cases, and at least
 gets the data into the DB where you can hack on it.  But it's still
 going to be painful for some use cases.

Yeah, that connects to the previous discussion about refactoring COPY
into a series of steps that the user can control.

Ultimately, there's always going to be a tradeoff between speed and
flexibility.  It may be that we should just say if you want to import
dirty data, it's gonna cost ya and not worry about the speed penalty
of subtransaction-per-row.  But that still leaves us with the 2^32
limit.  I wonder whether we could break down COPY into sub-sub
transactions to work around that...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another possible approach, which isn't perfect either, is the idea of
 allowing COPY to generate a single column of output of type text[].
 That greatly reduces the number of possible error cases, and at least
 gets the data into the DB where you can hack on it.  But it's still
 going to be painful for some use cases.

 Yeah, that connects to the previous discussion about refactoring COPY
 into a series of steps that the user can control.

 Ultimately, there's always going to be a tradeoff between speed and
 flexibility.  It may be that we should just say if you want to import
 dirty data, it's gonna cost ya and not worry about the speed penalty
 of subtransaction-per-row.  But that still leaves us with the 2^32
 limit.  I wonder whether we could break down COPY into sub-sub
 transactions to work around that...

How would that work?  Don't you still need to increment the command counter?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Joshua D. Drake
On Thu, 2009-10-08 at 11:59 -0400, Robert Haas wrote:
 On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Another possible approach, which isn't perfect either, is the idea of
  allowing COPY to generate a single column of output of type text[].
  That greatly reduces the number of possible error cases, and at least
  gets the data into the DB where you can hack on it.  But it's still
  going to be painful for some use cases.
 
  Yeah, that connects to the previous discussion about refactoring COPY
  into a series of steps that the user can control.
 
  Ultimately, there's always going to be a tradeoff between speed and
  flexibility.  It may be that we should just say if you want to import
  dirty data, it's gonna cost ya and not worry about the speed penalty
  of subtransaction-per-row.  But that still leaves us with the 2^32
  limit.  I wonder whether we could break down COPY into sub-sub
  transactions to work around that...
 
 How would that work?  Don't you still need to increment the command counter?

Couldn't you just commit each range of subtransactions based on some
threshold?

COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 100;

It counts to 1mil, commits starts a new transaction. Yes there would be
1million sub transactions but once it hits those clean, it commits.

?

Joshua D. Drake


 
 ...Robert
 
-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - 
Salamander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Rod Taylor
 Yeah.  I think it's going to be hard to make this work without having
 standalone transactions.  One idea would be to start a subtransaction,
 insert tuples until one fails, then rollback the subtransaction and
 start a new one, and continue on until the error limit is reached.


I've found performance is reasonable, for data with low numbers of errors
(say 1 per 100,000 records or less) doing the following:

SAVEPOINT bulk;
Insert 1000 records using COPY.

If there is an error, rollback to bulk, and step through each line
individually within its own individual subtransaction. All good lines are
kept and bad lines are logged; client side control makes logging trivial.

The next set of 1000 records is done in bulk again.

1000 records per savepoint seems to be a good point for my data without too
much time lost to overhead or too many records to retry due to a failing
record. Of course, it is controlled by the client side rather than server
side so reporting back broken records is trivial.


It may be possible to boost performance by:

1) Having copy remember which specific line caused the error. So it can
replace lines 1 through 487 in a subtransaction since it knows those are
successful. Run 488 in its on subtransaction. Run 489 through ... in a new
subtransaction.
2) Increasing the number of records per subtransaction if data is clean. It
wouldn't take long until you were inserting millions of records per
subtransaction for a large data set. This should make the subtransaction
overhead minimal. Small imports would still run slower but very large
imports of clean data should be essentially the same speed in the end.


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether we could break down COPY into sub-sub
 transactions to work around that...

 How would that work?  Don't you still need to increment the command counter?

Actually, command counter doesn't help because incrementing the CC
doesn't give you a rollback boundary between rows inserted before it
and afterwards.  What I was vaguely imaging was

-- outer transaction for whole COPY

-- sub-transactions that are children of outer transaction

-- sub-sub-transactions that are children of sub-transactions

You'd eat a sub-sub-transaction per row, and start a new sub-transaction
every 2^32 rows.

However, on second thought this really doesn't get us anywhere, it just
moves the 2^32 restriction somewhere else.  Once the outer transaction
gets to be more than 2^31 XIDs old, the database is going to stop
because of XID wraparound.

So really we have to find some way to only expend one XID per failure,
not one per row.

Another approach that was discussed earlier was to divvy the rows into
batches.  Say every thousand rows you sub-commit and start a new
subtransaction.  Up to that point you save aside the good rows somewhere
(maybe a tuplestore).  If you get a failure partway through a batch,
you start a new subtransaction and re-insert the batch's rows up to the
bad row.  This could be pretty awful in the worst case, but most of the
time it'd probably perform well.  You could imagine dynamically adapting
the batch size depending on how often errors occur ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Tom Lane
Joshua D. Drake j...@commandprompt.com writes:
 Couldn't you just commit each range of subtransactions based on some
 threshold?

 COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 100;

 It counts to 1mil, commits starts a new transaction. Yes there would be
 1million sub transactions but once it hits those clean, it commits.

Hmm, if we were willing to break COPY into multiple *top level*
transactions, that would avoid my concern about XID wraparound.
The issue here is that if the COPY does eventually fail (and there
will always be failure conditions, eg out of disk space), then some
of the previously entered rows would still be there; but possibly
not all of them, depending on whether we batch rows.  The latter
property actually bothers me more than the former, because it would
expose an implementation detail to the user.  Thoughts?

Also, this does not work if you want the copy to be part of a bigger
transaction, viz
BEGIN;
do something;
COPY ...;
do something else;
COMMIT;

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 12:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Joshua D. Drake j...@commandprompt.com writes:
 Couldn't you just commit each range of subtransactions based on some
 threshold?

 COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 100;

 It counts to 1mil, commits starts a new transaction. Yes there would be
 1million sub transactions but once it hits those clean, it commits.

 Hmm, if we were willing to break COPY into multiple *top level*
 transactions, that would avoid my concern about XID wraparound.
 The issue here is that if the COPY does eventually fail (and there
 will always be failure conditions, eg out of disk space), then some
 of the previously entered rows would still be there; but possibly
 not all of them, depending on whether we batch rows.

Yeah, I don't feel good about that.

 Also, this does not work if you want the copy to be part of a bigger
 transaction, viz
        BEGIN;
        do something;
        COPY ...;
        do something else;
        COMMIT;

Or that.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 Hmm, if we were willing to break COPY into multiple *top level*
 transactions, that would avoid my concern about XID wraparound.
 The issue here is that if the COPY does eventually fail (and there
 will always be failure conditions, eg out of disk space), then some
 of the previously entered rows would still be there; but possibly
 not all of them, depending on whether we batch rows.  The latter
 property actually bothers me more than the former, because it would
 expose an implementation detail to the user.  Thoughts?
 
 Also, this does not work if you want the copy to be part of a bigger
 transaction, viz
   BEGIN;
   do something;
   COPY ...;
   do something else;
   COMMIT;
 
I was considering suggesting multiple top-level transactions, partly
based on the fact that other bulk-load utilities that I've used
support that.  Then I realized that those were *client* side
applications, and didn't have to worry about being part of an
enclosing transaction.  It seems wrong to break transactional
semantics in a single statement execution.  Multiple top-level
transactions could be fine in a bulk-load *application*; but not, in
my view, in the COPY *statement*.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether we could break down COPY into sub-sub
 transactions to work around that...

 How would that work?  Don't you still need to increment the command counter?

 Actually, command counter doesn't help because incrementing the CC
 doesn't give you a rollback boundary between rows inserted before it
 and afterwards.  What I was vaguely imaging was

Oh, right.

 So really we have to find some way to only expend one XID per failure,
 not one per row.

Agreed.

 Another approach that was discussed earlier was to divvy the rows into
 batches.  Say every thousand rows you sub-commit and start a new
 subtransaction.  Up to that point you save aside the good rows somewhere
 (maybe a tuplestore).  If you get a failure partway through a batch,
 you start a new subtransaction and re-insert the batch's rows up to the
 bad row.  This could be pretty awful in the worst case, but most of the
 time it'd probably perform well.  You could imagine dynamically adapting
 the batch size depending on how often errors occur ...

Yeah, I think that's promising.  There is of course the possibility
that a row which previously succeeded could fail the next time around,
but most of the time that shouldn't happen, and it should be possible
to code it so that it still behaves somewhat sanely if it does.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Greg Smith

On Thu, 8 Oct 2009, Tom Lane wrote:

It may be that we should just say if you want to import dirty data, 
it's gonna cost ya and not worry about the speed penalty of 
subtransaction-per-row.


This goes along with the response I gave on objections to adding other 
bits of overhead into COPY.  If the performance only suffers when you're 
targeting unclean data, the users this feature targets will glady accept 
that trade-off.  You're still way ahead of the other options here at the 
finish line.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Greg Smith

On Thu, 8 Oct 2009, Rod Taylor wrote:

1) Having copy remember which specific line caused the error. So it can 
replace lines 1 through 487 in a subtransaction since it knows those are 
successful. Run 488 in its on subtransaction. Run 489 through ... in a 
new subtransaction.


This is the standard technique used in other bulk loaders I'm aware of.

2) Increasing the number of records per subtransaction if data is clean. 
It wouldn't take long until you were inserting millions of records per 
subtransaction for a large data set.


You can make it adaptive in both directions with some boundaries.  If you 
double the batch size every time there's a clean commit, and halve it 
every time there's an error, start batching at 1024 and bound to the range 
[1,1048576].  That's close to optimal behavior here if combined with the 
targeted retry described in (1).


The retry scheduling and batch size parts are the trivial and well 
understood parts here.  Actually getting all this to play nicely with 
transactions and commit failures (rather than just bad data failures) is 
what's difficult.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another approach that was discussed earlier was to divvy the rows into
 batches.  Say every thousand rows you sub-commit and start a new
 subtransaction.  Up to that point you save aside the good rows somewhere
 (maybe a tuplestore).  If you get a failure partway through a batch,
 you start a new subtransaction and re-insert the batch's rows up to the
 bad row.  This could be pretty awful in the worst case, but most of the
 time it'd probably perform well.  You could imagine dynamically adapting
 the batch size depending on how often errors occur ...

 Yeah, I think that's promising.  There is of course the possibility
 that a row which previously succeeded could fail the next time around,
 but most of the time that shouldn't happen, and it should be possible
 to code it so that it still behaves somewhat sanely if it does.

Actually, my thought was that failure to reinsert a previously good
tuple should cause us to abort the COPY altogether.  This is a
cheap-and-easy way of avoiding sorceror's apprentice syndrome.
Suppose the failures are coming from something like out of disk space,
transaction timeout, whatever ... a COPY that keeps on grinding no
matter what is *not* ideal.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Robert Haas
On Thu, Oct 8, 2009 at 1:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another approach that was discussed earlier was to divvy the rows into
 batches.  Say every thousand rows you sub-commit and start a new
 subtransaction.  Up to that point you save aside the good rows somewhere
 (maybe a tuplestore).  If you get a failure partway through a batch,
 you start a new subtransaction and re-insert the batch's rows up to the
 bad row.  This could be pretty awful in the worst case, but most of the
 time it'd probably perform well.  You could imagine dynamically adapting
 the batch size depending on how often errors occur ...

 Yeah, I think that's promising.  There is of course the possibility
 that a row which previously succeeded could fail the next time around,
 but most of the time that shouldn't happen, and it should be possible
 to code it so that it still behaves somewhat sanely if it does.

 Actually, my thought was that failure to reinsert a previously good
 tuple should cause us to abort the COPY altogether.  This is a
 cheap-and-easy way of avoiding sorceror's apprentice syndrome.
 Suppose the failures are coming from something like out of disk space,
 transaction timeout, whatever ... a COPY that keeps on grinding no
 matter what is *not* ideal.

I think you handle that by putting a cap on the total number of errors
you're willing to accept (and in any event you'll always skip the row
that failed, so forward progress can't cease altogether).  For out of
disk space or transaction timeout, sure, but you might also have
things like a serialization error that occurs on the reinsert that
didn't occur on the original.  You don't want that to kill the whole
bulk load, I would think.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Bruce Momjian
Dimitri Fontaine wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  It will be best to have the ability to have a specific rejection reason
  for each row rejected. That way we will be able to tell the difference
  between uniqueness violation errors, invalid date format on col7, value
  fails check constraint on col22 etc.. 
 
 In case that helps, what pgloader does is logging into two files, named
 after the table name (not scalable to server-side solution):
   table.rej --- lines it could not load, straight from source file
   table.rej.log --- errors as given by the server, plus pgloader comment
 
 The pgloader comment is necessary for associating each log line to the
 source file line, as it's operating by dichotomy, the server always
 report error on line 1.
 
 The idea of having two errors file could be kept though, the aim is to
 be able to fix the setup then COPY again the table.rej file when it
 happens the errors are not on the file content. Or for loading into
 another table, with all columns as text or bytea, then clean data from a
 procedure.

What would be _cool_ would be to add the ability to have comments in the
COPY files, like \#, and then the copy data lines and errors could be
adjacent.  (Because of the way we control COPY escaping, adding \# would
not be a problem.  We have \N for null, for example.)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Andrew Dunstan



Bruce Momjian wrote:

What would be _cool_ would be to add the ability to have comments in the
COPY files, like \#, and then the copy data lines and errors could be
adjacent.  (Because of the way we control COPY escaping, adding \# would
not be a problem.  We have \N for null, for example.)

  


That wouldn't be of any use with CSV files, which in my experience are 
far more likely to be sources of data errors than input files in 
Postgres Text format.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Bruce Momjian
Robert Haas wrote:
 Each of those features deserves a separate discussion to decide
 whether we want it and how best to implement it.  Personally, I think
 we should skip (C), at least as a starting point.  Instead of logging
 to a table, I think we should consider making COPY return the tuples
 indicating the error as a query result, the same way EXPLAIN returns
 the query plan.  It looks like this would eliminate the need for at
 least three of the new COPY options without losing much functionality.

Can we capture EXPLAIN output into a table?  If so, can the COPY error
output be sent to a table so we don't need to add this functionality to
COPY specifically?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Bruce Momjian
Robert Haas wrote:
  That was a compliment on your project management skills. ?Keeping the CF
  work moving forward steadily is both unglamorous and extremely valuable, and
  I don't think anyone else even understands why you've volunteered to handle
  so much of it. ?But I know I appreciate it.
 
 Thanks.  I'm not sure I completely understand it, either.  I'm sort of
 hoping someone else will step up to the plate, but in the absence of
 such a person I kind of hate to leave it hanging.  Some defective part
 of my brain enjoys seeing things run smoothly more than it enjoys
 being lazy.

I am the same.

Also, I do get asked the Where did this Robert Haas come from question
occasionally.  ;-)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Simon Riggs
On Thu, 2009-10-08 at 18:23 -0400, Bruce Momjian wrote:
 Dimitri Fontaine wrote:
  Simon Riggs si...@2ndquadrant.com writes:
   It will be best to have the ability to have a specific rejection reason
   for each row rejected. That way we will be able to tell the difference
   between uniqueness violation errors, invalid date format on col7, value
   fails check constraint on col22 etc.. 
  
  In case that helps, what pgloader does is logging into two files, named
  after the table name (not scalable to server-side solution):
table.rej --- lines it could not load, straight from source file
table.rej.log --- errors as given by the server, plus pgloader comment
  
  The pgloader comment is necessary for associating each log line to the
  source file line, as it's operating by dichotomy, the server always
  report error on line 1.
  
  The idea of having two errors file could be kept though, the aim is to
  be able to fix the setup then COPY again the table.rej file when it
  happens the errors are not on the file content. Or for loading into
  another table, with all columns as text or bytea, then clean data from a
  procedure.
 
 What would be _cool_ would be to add the ability to have comments in the
 COPY files, like \#, and then the copy data lines and errors could be
 adjacent.  (Because of the way we control COPY escaping, adding \# would
 not be a problem.  We have \N for null, for example.)

That was my idea also until I heard Dimitri's two file approach.

Having a pristine data file and a matching error file means you can
potentially just resubmit the error file again. Often you need to do
things like trap RI errors and then resubmit them at a later time once
the master rows have entered the system.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-08 Thread Simon Riggs
On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I wonder whether we could break down COPY into sub-sub
  transactions to work around that...
 
  How would that work?  Don't you still need to increment the command counter?
 
 Actually, command counter doesn't help because incrementing the CC
 doesn't give you a rollback boundary between rows inserted before it
 and afterwards.  What I was vaguely imaging was
 
   -- outer transaction for whole COPY
 
   -- sub-transactions that are children of outer transaction
 
   -- sub-sub-transactions that are children of sub-transactions
 
 You'd eat a sub-sub-transaction per row, and start a new sub-transaction
 every 2^32 rows.
 
 However, on second thought this really doesn't get us anywhere, it just
 moves the 2^32 restriction somewhere else.  Once the outer transaction
 gets to be more than 2^31 XIDs old, the database is going to stop
 because of XID wraparound.
 
 So really we have to find some way to only expend one XID per failure,
 not one per row.

I discovered a few days back that ~550 subtransactions is sufficient to
blow max_stack_depth. 1 subtransaction per error doesn't allow many
errors.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Greg Smith

On Mon, 5 Oct 2009, Josh Berkus wrote:


I think that this was the original idea but we should probably rollback
the error logging if the command has been rolled back. It might be more
consistent to use the same hi_options as the copy command. Any idea what
would be best?


Well, if we're logging to a file, you wouldn't be *able* to roll them
back.  Also, presumbly, if you abort a COPY because of errors, you
probably want to keep the errors around for later analysis.  No?


Absolutely, that's the whole point of logging to a file in the first 
place.  What needs to happen here is that when one is aborted, you need to 
make sure that fact is logged, and with enough information (the pid?) to 
tie it to the COPY that failed.  Then someone can crawl the logs to figure 
out what happened and what data did and didn't get loaded.  Ideally you'd 
want to have that as database table information instead, to allow 
automated recovery and re-commit in the cases where the error wasn't 
inherent in the data but instead some other type of database failure.


I know this patch is attracting more reviewers lately, is anyone tracking 
the general architecture of the code yet?  Emmanuel's work is tough to 
review just because there's so many things mixed together, and there's 
other inputs I think should be considered at the same time while we're all 
testing in there (such as the COPY patch Andrew Dunstan put together).


What I'd like to see is for everything to get broken more into component 
chunks that can get commited and provide something useful one at a time, 
because I doubt taskmaster Robert is going to let this one linger around 
with scope creep for too long before being pushed out to the next 
CommitFest.  It would be nice to have a clear game plan that involves the 
obvious 3 smaller subpatches that can be commited one at a time as they're 
ready to focus the work on, so that something might be polished enough for 
this CF.  And, yes, I'm suggesting work only because I now have some time 
to help with that if the idea seems reasonable to persue.  Be glad to set 
up a public git repo or something to serve as an integration point for 
dependency merge management and testing that resists bit-rot while 
splitting things up functionally.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Simon Riggs

On Wed, 2009-10-07 at 03:17 -0400, Greg Smith wrote:
 On Mon, 5 Oct 2009, Josh Berkus wrote:
 
  Also, presumbly, if you abort a COPY because of errors, you
  probably want to keep the errors around for later analysis.  No?
 
 Absolutely, that's the whole point of logging to a file in the first 
 place. 

Yes, essential.

(Not read patch, so some later comments may misunderstand where we're
at)

  What needs to happen here is that when one is aborted, you need to 
 make sure that fact is logged, and with enough information (the pid?) to 
 tie it to the COPY that failed.  Then someone can crawl the logs to figure 
 out what happened and what data did and didn't get loaded.  Ideally you'd 
 want to have that as database table information instead, to allow 
 automated recovery and re-commit in the cases where the error wasn't 
 inherent in the data but instead some other type of database failure.

Adding something to the logs is the wrong place for that IMHO. If we do
write a log message it should be a NOTICE not a LOG.

If the user specifies an error file then it should be straightforward
for them to look directly in that error file afterwards to see if there
are rejects. It's typical to only create the error file if rejects
exist, to allow a simple if-file-exists test after the COPY. 

I don't recommend having the server automatically generate an error file
name because that will encourage conflicts between multiple users on
production systems. If the user wants an errorfile they should specify
it, that way they will know the name.

It will be best to have the ability to have a specific rejection reason
for each row rejected. That way we will be able to tell the difference
between uniqueness violation errors, invalid date format on col7, value
fails check constraint on col22 etc.. 

An implicit I got an error on this record isn't enough information and
can lead to chaos, since you may be getting more than one error on a
record and we need to be able to fix them one at a time. If we can't
tell what the error message is then we might think our first valid fix
actually failed and start looking for other solutions.

There are security implications of writing stuff to an error file, so
will the error file option still be available when we do \copy or COPY
FROM STDIN, and if so how will it work? With what restrictions?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith gsm...@gregsmith.com wrote:
 I know this patch is attracting more reviewers lately, is anyone tracking
 the general architecture of the code yet?  Emmanuel's work is tough to
 review just because there's so many things mixed together, and there's other
 inputs I think should be considered at the same time while we're all testing
 in there (such as the COPY patch Andrew Dunstan put together).

I hadn't realized this was an issue, but I think it's a good point: a
patch that does one thing well is much more likely to get accepted
than a patch that does two things well, let alone two things poorly.
It's just much easier to review and verify.  Or maybe the name of the
patch maybe should have tipped me off: COPY enhancements vs. make
COPY have feature X.

 What I'd like to see is for everything to get broken more into component
 chunks that can get commited and provide something useful one at a time,
 because I doubt taskmaster Robert is going to let this one linger around
 with scope creep for too long before being pushed out to the next
 CommitFest.

I'm can't decide whether to feel good or bad about that appelation, so
I'm going with both.  But in all seriousness if this patch needs
substantial reworking (which it sounds like it does) we should
postpone it to the next CF; we are quickly running out of days, and
it's not fair to reviewers or committers to ask for new reviews of
substantially revised code with a only a week to go.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Emmanuel Cecchet

Hi all,

I think there is a misunderstanding about what the current patch is about.
The patch includes 2 things:
- error logging in a table for bad tuples in a COPY operation (see 
http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; 
the error message, command and so on are automatically logged)
- auto-partitioning in a hierarchy of child table if the COPY targets a 
parent table.


The patch does NOT include:
- logging errors into a file (a feature we can add later on (next commit 
fest?))


I can put the auto-partitioning in a separate patch if that helps but 
this patch relies on error logging to log possible errors in the routing 
of tuples.


I think that the way to move forward is first to have a basic error 
logging facility that logs into a database table (like the current patch 
does) and then add enhancements. I don't think we should try to do the 
file logging at the same time. If you prefer to postpone the 
auto-partitioning to the next commit fest, I can strip it from the 
current patch and re-submit it for the next fest (but it's just 2 
isolated methods really easy to review).


Emmanuel


Robert Haas wrote:

On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith gsm...@gregsmith.com wrote:
  

I know this patch is attracting more reviewers lately, is anyone tracking
the general architecture of the code yet?  Emmanuel's work is tough to
review just because there's so many things mixed together, and there's other
inputs I think should be considered at the same time while we're all testing
in there (such as the COPY patch Andrew Dunstan put together).



I hadn't realized this was an issue, but I think it's a good point: a
patch that does one thing well is much more likely to get accepted
than a patch that does two things well, let alone two things poorly.
It's just much easier to review and verify.  Or maybe the name of the
patch maybe should have tipped me off: COPY enhancements vs. make
COPY have feature X.

  

What I'd like to see is for everything to get broken more into component
chunks that can get commited and provide something useful one at a time,
because I doubt taskmaster Robert is going to let this one linger around
with scope creep for too long before being pushed out to the next
CommitFest.



I'm can't decide whether to feel good or bad about that appelation, so
I'm going with both.  But in all seriousness if this patch needs
substantial reworking (which it sounds like it does) we should
postpone it to the next CF; we are quickly running out of days, and
it's not fair to reviewers or committers to ask for new reviews of
substantially revised code with a only a week to go.

...Robert
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Andrew Dunstan



Emmanuel Cecchet wrote:
If you prefer to postpone the auto-partitioning to the next commit 
fest, I can strip it from the current patch and re-submit it for the 
next fest (but it's just 2 isolated methods really easy to review).





I certainly think this should be separated out. In general it is not a 
good idea to roll distinct features together. It complicates both the 
reviewing process and the discussion.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Dimitri Fontaine
Simon Riggs si...@2ndquadrant.com writes:
 It will be best to have the ability to have a specific rejection reason
 for each row rejected. That way we will be able to tell the difference
 between uniqueness violation errors, invalid date format on col7, value
 fails check constraint on col22 etc.. 

In case that helps, what pgloader does is logging into two files, named
after the table name (not scalable to server-side solution):
  table.rej --- lines it could not load, straight from source file
  table.rej.log --- errors as given by the server, plus pgloader comment

The pgloader comment is necessary for associating each log line to the
source file line, as it's operating by dichotomy, the server always
report error on line 1.

The idea of having two errors file could be kept though, the aim is to
be able to fix the setup then COPY again the table.rej file when it
happens the errors are not on the file content. Or for loading into
another table, with all columns as text or bytea, then clean data from a
procedure.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Simon Riggs

On Wed, 2009-10-07 at 15:33 +0200, Dimitri Fontaine wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  It will be best to have the ability to have a specific rejection reason
  for each row rejected. That way we will be able to tell the difference
  between uniqueness violation errors, invalid date format on col7, value
  fails check constraint on col22 etc.. 
 
 In case that helps, what pgloader does is logging into two files, named
 after the table name (not scalable to server-side solution):
   table.rej --- lines it could not load, straight from source file
   table.rej.log --- errors as given by the server, plus pgloader comment
 
 The pgloader comment is necessary for associating each log line to the
 source file line, as it's operating by dichotomy, the server always
 report error on line 1.
 
 The idea of having two errors file could be kept though, the aim is to
 be able to fix the setup then COPY again the table.rej file when it
 happens the errors are not on the file content. Or for loading into
 another table, with all columns as text or bytea, then clean data from a
 procedure.

I like it.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 Hi all,

 I think there is a misunderstanding about what the current patch is about.
 The patch includes 2 things:
 - error logging in a table for bad tuples in a COPY operation (see
 http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
 error message, command and so on are automatically logged)
 - auto-partitioning in a hierarchy of child table if the COPY targets a
 parent table.
 The patch does NOT include:
 - logging errors into a file (a feature we can add later on (next commit
 fest?))

My failure to have read the patch is showing here, but it seems to me
that error logging to a table could be problematic: if the transaction
aborts, we'll lose the log.  If this is in fact a problem, we should
be implementing logging to a file (or stdout) FIRST.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Emmanuel Cecchet wrote:
 If you prefer to postpone the auto-partitioning to the next commit 
 fest, I can strip it from the current patch and re-submit it for the 
 next fest (but it's just 2 isolated methods really easy to review).

 I certainly think this should be separated out. In general it is not a 
 good idea to roll distinct features together. It complicates both the 
 reviewing process and the discussion.

I think though that Greg was suggesting that we need some more thought
about the overall road map.  Agglomerating independent features onto
COPY one at a time is going to lead to a mess, unless they fit into an
overall design plan.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:
  

Emmanuel Cecchet wrote:

If you prefer to postpone the auto-partitioning to the next commit 
fest, I can strip it from the current patch and re-submit it for the 
next fest (but it's just 2 isolated methods really easy to review).
  


  
I certainly think this should be separated out. In general it is not a 
good idea to roll distinct features together. It complicates both the 
reviewing process and the discussion.



I think though that Greg was suggesting that we need some more thought
about the overall road map.  Agglomerating independent features onto
COPY one at a time is going to lead to a mess, unless they fit into an
overall design plan.


  


I don't disagree with that. But even if we get a roadmap of some set of 
features we want to implement, rolling them all together isn't a good 
way to go.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Emmanuel Cecchet

Robert Haas wrote:

On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet m...@asterdata.com wrote:
  

Hi all,

I think there is a misunderstanding about what the current patch is about.
The patch includes 2 things:
- error logging in a table for bad tuples in a COPY operation (see
http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
error message, command and so on are automatically logged)
- auto-partitioning in a hierarchy of child table if the COPY targets a
parent table.
The patch does NOT include:
- logging errors into a file (a feature we can add later on (next commit
fest?))



My failure to have read the patch is showing here, but it seems to me
that error logging to a table could be problematic: if the transaction
aborts, we'll lose the log.  If this is in fact a problem, we should
be implementing logging to a file (or stdout) FIRST.
  
I don't think this is really a problem. You can always do a SELECT in 
the error table after the COPY operation if you want to diagnose what 
happened before the transaction rollbacks.  I think that using a file to 
process the bad tuples is actually less practical than a table if you 
want to re-insert these tuples in the database.
But anyway, the current implementation captures the tuple with all the 
needed information for logging and delegate the logging to a specific 
method. If we want to log to a file (or stdout), we can just provide 
another method to log the already captured info in a file.


I think that the file approach is a separate feature but can be easily 
integrated in the current design. There is just extra work to make sure 
concurrent COPY commands writing to the same error file are using proper 
locking.


Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Wed, Oct 7, 2009 at 11:39 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 Robert Haas wrote:

 On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet m...@asterdata.com
 wrote:


 Hi all,

 I think there is a misunderstanding about what the current patch is
 about.
 The patch includes 2 things:
 - error logging in a table for bad tuples in a COPY operation (see
 http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
 error message, command and so on are automatically logged)
 - auto-partitioning in a hierarchy of child table if the COPY targets a
 parent table.
 The patch does NOT include:
 - logging errors into a file (a feature we can add later on (next commit
 fest?))


 My failure to have read the patch is showing here, but it seems to me
 that error logging to a table could be problematic: if the transaction
 aborts, we'll lose the log.  If this is in fact a problem, we should
 be implementing logging to a file (or stdout) FIRST.


 I don't think this is really a problem. You can always do a SELECT in the
 error table after the COPY operation if you want to diagnose what happened
 before the transaction rollbacks.

Uhhh if the transaction has aborted, no new commands (including
SELECT) will be accepted until you roll back.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Emmanuel Cecchet
The roadmap I would propose for the current list of enhancements to COPY 
is as follows:

1. new syntax for COPY options (already committed)
2. error logging in a table
3. auto-partitioning (just relies on basic error logging, so can be 
scheduled anytime after 2)

4. error logging in a file

manu

Andrew Dunstan wrote:

Tom Lane wrote:
  

Andrew Dunstan and...@dunslane.net writes:
  


Emmanuel Cecchet wrote:

  
If you prefer to postpone the auto-partitioning to the next commit 
fest, I can strip it from the current patch and re-submit it for the 
next fest (but it's just 2 isolated methods really easy to review).
  

  

I certainly think this should be separated out. In general it is not a 
good idea to roll distinct features together. It complicates both the 
reviewing process and the discussion.

  

I think though that Greg was suggesting that we need some more thought
about the overall road map.  Agglomerating independent features onto
COPY one at a time is going to lead to a mess, unless they fit into an
overall design plan.


  



I don't disagree with that. But even if we get a roadmap of some set of 
features we want to implement, rolling them all together isn't a good 
way to go.


cheers

andrew


  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Emmanuel Cecchet

Robert Haas wrote:

On Wed, Oct 7, 2009 at 11:39 AM, Emmanuel Cecchet m...@asterdata.com wrote:
  

Robert Haas wrote:


On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet m...@asterdata.com
wrote:

  

Hi all,

I think there is a misunderstanding about what the current patch is
about.
The patch includes 2 things:
- error logging in a table for bad tuples in a COPY operation (see
http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
error message, command and so on are automatically logged)
- auto-partitioning in a hierarchy of child table if the COPY targets a
parent table.
The patch does NOT include:
- logging errors into a file (a feature we can add later on (next commit
fest?))



My failure to have read the patch is showing here, but it seems to me
that error logging to a table could be problematic: if the transaction
aborts, we'll lose the log.  If this is in fact a problem, we should
be implementing logging to a file (or stdout) FIRST.

  

I don't think this is really a problem. You can always do a SELECT in the
error table after the COPY operation if you want to diagnose what happened
before the transaction rollbacks.



Uhhh if the transaction has aborted, no new commands (including
SELECT) will be accepted until you roll back.
  
You are suggesting then that it is the COPY command that aborts the 
transaction. That would only happen if you had set a limit on the number 
of errors that you want to accept in a COPY command (in which case you 
know that there is something really wrong with your input file and you 
don't want the server to process that file).


manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Emmanuel Cecchet

Greg Smith wrote:
Absolutely, that's the whole point of logging to a file in the first 
place.  What needs to happen here is that when one is aborted, you 
need to make sure that fact is logged, and with enough information 
(the pid?) to tie it to the COPY that failed.  Then someone can crawl 
the logs to figure out what happened and what data did and didn't get 
loaded.  Ideally you'd want to have that as database table information 
instead, to allow automated recovery and re-commit in the cases where 
the error wasn't inherent in the data but instead some other type of 
database failure.
If one is aborted, nothing gets loaded anyway. Now if you want a 
separate log of the successful commands, I don't think that should apply 
just to COPY but to any operation (insert, update, DDL, ...) if you want 
to build some automatic retry mechanism. But this seems independent from 
COPY to me.


manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Wed, Oct 7, 2009 at 11:45 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 You are suggesting then that it is the COPY command that aborts the
 transaction. That would only happen if you had set a limit on the number of
 errors that you want to accept in a COPY command (in which case you know
 that there is something really wrong with your input file and you don't want
 the server to process that file).

But I still want my log even if the COPY bombs out.  I want to report
the first group of errors back to my user...

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Greg Smith

On Wed, 7 Oct 2009, Emmanuel Cecchet wrote:

I think there is a misunderstanding about what the current patch is 
about...the patch does NOT include logging errors into a file (a feature 
we can add later on (next commit fest?))


I understand that (as one of the few people who has read the patch it 
seems), but as you can might guess from the feedback here that's not the 
way I expect your patch is actually going to be handled.  Something that 
logs only to a table without the interface for the file output at least 
specified isn't the sort of thing that this group tends to go along with 
very well.  Since as it is we haven't even nailed down how the file output 
is even going to look or work yet, the table logging isn't very likely to 
go in unless it's at least clear that the future plans there will cleanly 
stack on top.  That's what people are alluding to mentioning the whole 
roadmap concept for all this.


I get the impression you were hoping to get another chunk of this commited 
on this round.  I would guess that realistically it's going to take at 
least a few more weeks for the rest of this to get nailed down, and that a 
really solid feature should be ready by the next CF instead.  You should 
actually be proud of the progress you spurred on the COPY options stuff 
that's in there now, that idea has been floating around for a while now 
but until your patch there wasn't any momentum behind doing something 
about it.  The problem you're facing now is that so many people have been 
thinking about this particular area for so long without any code to talk 
about that you're now stuck with all that pent up uncoded design to clear.


I can put the auto-partitioning in a separate patch if that helps but this 
patch relies on error logging to log possible errors in the routing of 
tuples.


Understood.  I know you've gotten some other coding feedback here, and 
you've been very good about taking that and producing updated versions 
which is appreciated.  I would recommend that the next time you resubmit 
this, if you could break the logging and partitioning patches into a pair 
that depend on one another, that would make life easier for everyone else 
(albeit probably a harder for you).  At that point we should be set to 
have some others take over some of that tweaking, with myself, Selena, and 
Jeff all interested and capable of helping out here.


If you prefer to postpone the auto-partitioning to the next commit fest, 
I can strip it from the current patch and re-submit it for the next fest 
(but it's just 2 isolated methods really easy to review).


That was one of points I was trying to make--that would be an easier to 
review by itself, but as it is people interested in it can't consider it 
without also staring at the logging stuff.  And people who are focusing on 
the logging bits find it distracting, so nobody is really happy with the 
current combined patch.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Greg Smith

On Wed, 7 Oct 2009, Robert Haas wrote:


On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith gsm...@gregsmith.com wrote:

I doubt taskmaster Robert is going to let this one linger around with 
scope creep for too long before being pushed out to the next 
CommitFest.


I'm can't decide whether to feel good or bad about that appelation, so
I'm going with both.


That was a compliment on your project management skills.  Keeping the CF 
work moving forward steadily is both unglamorous and extremely valuable, 
and I don't think anyone else even understands why you've volunteered to 
handle so much of it.  But I know I appreciate it.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Wed, Oct 7, 2009 at 7:52 PM, Greg Smith gsm...@gregsmith.com wrote:
 On Wed, 7 Oct 2009, Robert Haas wrote:

 On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith gsm...@gregsmith.com wrote:

 I doubt taskmaster Robert is going to let this one linger around with
 scope creep for too long before being pushed out to the next CommitFest.

 I'm can't decide whether to feel good or bad about that appelation, so
 I'm going with both.

 That was a compliment on your project management skills.  Keeping the CF
 work moving forward steadily is both unglamorous and extremely valuable, and
 I don't think anyone else even understands why you've volunteered to handle
 so much of it.  But I know I appreciate it.

Thanks.  I'm not sure I completely understand it, either.  I'm sort of
hoping someone else will step up to the plate, but in the absence of
such a person I kind of hate to leave it hanging.  Some defective part
of my brain enjoys seeing things run smoothly more than it enjoys
being lazy.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-07 Thread Robert Haas
On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 Robert,

 Here is the new version of the patch that applies to CVS HEAD as of this
 morning.

 Emmanuel

I took a look at this patch tonight and, having now read through some
of it, I have some more detailed comments.

It seems quite odd to me that when COPY succeeds but there are errors,
the transaction commits.  The only indication that some of my data
didn't end up in the table is that the output says COPY n where n is
less than the total number of rows I attempted to copy.  On the other
hand, it would be equally bad if the transaction aborted, because then
my error logging table would not get populated - I note that that's
the behavior we do get if the max errors threshold is exceeded.  I'm
not sure what the right answer is here, but it needs some thought and
discussion.  I think at a minimum the caller needs some indication of
the number of FAILED rows, not just the number of succesful ones.

What's really bad about this is that a flag called error_logging is
actually changing the behavior of the command in a way that is far
more dramatic than (and doesn't actually have much to do with) error
logging.  It's actually making a COPY command succeed that would
otherwise have failed.  That's not just a logging change.

I think the underlying problem is that there is more than one feature
here, and they're kind of mixed together.  The fact that the
partitioning code needs to be separated from the error logging stuff
has already been discussed, but I think it actually needs to be broken
down even further.  I think there are three separate features in the
error logging code:

A. Ability to ignore a certain number of errors (of certain types?)
and still get the other tuples into the table anyway.
B. Ability to return error information in a structured format rather
than as an exception message.
C. Ability to direct the structured error messages to a table.

Each of those features deserves a separate discussion to decide
whether we want it and how best to implement it.  Personally, I think
we should skip (C), at least as a starting point.  Instead of logging
to a table, I think we should consider making COPY return the tuples
indicating the error as a query result, the same way EXPLAIN returns
the query plan.  It looks like this would eliminate the need for at
least three of the new COPY options without losing much functionality.

I also think that, if possible, (A) and (B) should be implemented as
separate features, so that each one can be used separately or both can
be used in combination.

Some other notes:

1. The copy_errorlogging.out regression test fails, at least on my
machine.  I guess this problem was discussed previously, but perhaps
you should change the setup of the test in some way so that you can
generate a patch that doesn't require manual fiddling to get it the
regression tests to pass.

2. The error logging doesn't seem to work in all cases.

rhaas=# copy foo from '/home/rhaas/x' (error_logging,
error_logging_table_name 'errlogging');
ERROR:  duplicate key value violates unique constraint foo_pkey
DETAIL:  Key (id)=(1) already exists.
CONTEXT:  COPY foo, line 1: 1  Robert

I assume that this fails because the code is trained to catch some
specific category of errors and redirect those to the error logging
table while letting everything take its natural course.  I don't feel
that that's a terribly useful behavior, and on the other hand, I'm not
sure it's going to be possible to fix it without killing performance.

3. The option checking for the error_logging options is insufficient.
For example, if ERROR_LOGGING_TABLE_NAME is specified but
ERROR_LOGGING is not specified, COPY still works (and just ignores the
ERROR_LOGGING_TABLE_NAME). Note that this is quite different from what
HEADER does, for example.

4. Specifiying an error_logging_table_name with spaces or other funny
characters in it fails.

COPY foo FROM STDIN (ERROR_LOGGING, ERROR_LOGGING_TABLE_NAME 'foo bar baz');

5. The assumption that the public schema is an appropriate default
does not seem right to me.  I would think that we would want to
default to the first schema in the user's search path.

6. errorlogging.c is poorly named, because it has specifically to do
with logging errors in COPY, rather than with error logging in
general; there's also no reason for it to live in utils/misc.  Also,
it does not include the same headers as the other files in the same
directory.

7. There's no need for the OidLinkedList structure; we already have a
system for handling lists of OIDs.  See pg_list.h.

8. The OID cache is interesting; have you benchmarked this to see how
much it improves performance?  If so, you should describe your
methodology and performance results.  It might be worthwhile to leave
this out of the first version of the partitioning patch for simplicity
and add submit as a follow-on performance patch.  Do we really need a
GUC to size this cache, or 

Re: [HACKERS] COPY enhancements

2009-10-06 Thread Emmanuel Cecchet

I just realized that I forgot to CC the list when I answered to Josh... 
resending!

Josh,



 I think that this was the original idea but we should probably rollback
 the error logging if the command has been rolled back. It might be more
 consistent to use the same hi_options as the copy command. Any idea what
 would be best?
 



 Well, if we're logging to a file, you wouldn't be *able* to roll them
 back.  Also, presumbly, if you abort a COPY because of errors, you
 probably want to keep the errors around for later analysis.  No?
   
  
You are right about the file (though this functionality is not 
implemented yet).
I don't have any strong opinion about the right behavior if COPY aborts. 
The error log table would contain tuples that were not inserted anyway. 
Now knowing whether the bad tuples come from a successful COPY command 
or not will be left to the user.


Emmanuel
-- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com



Re: [HACKERS] COPY enhancements

2009-10-05 Thread Emmanuel Cecchet

Hi Selena,


This is my first pass at the error logging portion of this patch. I'm
going to take a break and try to go through the partitioning logic as
well later this afternoon.

caveat: I'm not familiar with most of the code paths that are being
touched by this patch.

Overall:
* I noticed '\see' included in the comments in your code. These should
be removed.
  

Should we avoid any doxygen tags in general in the Postgres code?

* The regression is failling, as Jeff indicated, and I didn't figure
out why yet either. Hopefully will have a look closer this afternoon.
  

Let me know if the response I sent to Jeff works for you.

Comments:
* copy.c: Better formatting, maybe rewording needed for comment
starting on line 1990.
  
Some of the bad formatting has been left on purpose to minimize the size 
of the patch otherwise there would have been many tab/white spaces 
changes due to the indentation in the PG_TRY blocks. I suggest that 
whoever is going to commit the code runs pg_indent or I can fix the 
formatting once the reviewing is done.

** Maybe say: Check that errorData-sqlerrcode only logged tuples
that are malformed. This ensures that we let other errors pass
through.
  

Ok.

* copy.c: line: 2668 - need to fix that comment :) (/* Regular code
*/) -- this needs to be more descriptive of what is actually
happening. We fell through the partitioning logic, right, and back to
the default behavior?
  

Yes.

* copy.c: line 2881, maybe say instead: Mark that we have not read a
line yet. This is necessary so that we can perform error logging on
complete lines only.
  

Ok.

Formatting:
* copy.c: whitespace is maybe a little off at line: 2004-2009
* NEW FILES: src/backend/utils/misc/errorlogging.c  errorlogging.h need headers
  
I was not sure what is auto-generated by SVN commit scripts. I'd be 
happy to add headers. Is there a specification somewhere or should I 
just copy/paste from another file?

Code:
* copy.c: line 1990 - cur_lineno_str[11]  related: why is this
conversion necessary? (sorry if that is a dumb question)
  
This conversion is necessary if you log in the error table the index of 
the row that is causing the error (this is what is logged by default in 
ERROR_LOGGING_KEY).

* copy.c: line 2660 - what if we are error logging for copy? Does
this get logged properly?
  

Yes, this is in a try/catch block that performs error logging.

* errorlogging.c: Noticed the comment at 503 - 'note that we always
enable wal logging'.. Thinking this through - the reasoning is that
you won't be rolling back the error logging no matter what, right?
  
I think that this was the original idea but we should probably rollback 
the error logging if the command has been rolled back. It might be more 
consistent to use the same hi_options as the copy command. Any idea what 
would be best?

* src/include/commands/copy.c and copy.h: struct CopyStateData was
moved from copy.c to copy.h; this made sense to me, just noting. That
move made it a little tricky to find the changes that were made. There
were 10 items added.
  
Yes, patch can be a little bit lost when you move a big data structure 
like this one.

** super nit pick: 'readlineOk' uses camel-case instead of underscores
like the rest of the new variables
  

Yes, queryDesc also has camel-case. I will fix readlineOk.

* errorlogging.c: could move currentCommand pg_stat call in Init
routine into MalformedTupleIs, or maybe move the setting of that
parameter into the Init routine for consistency's sake.
  
The advantage of  calling pg_stat in InitializeErrorLogging is that it 
is evaluated only once for the entire copy command (and it's not going 
to change during the execution of the command). I am not sure to 
understand what your second suggestion is since currentCommand is set 
and initialized in Init.

Documentation:
* doc/src/sgml/ref/copy.sgml: line 313: 'failif' needs a space
  

ok

** Also: The error table log examples have relations shown in a
different order than the actual CREATE TABLE declaration in the code.
  
The order of the columns does not really matter but for consistency sake 
we can put the same order.

* src/test/regress/sql/copyselect.sql: Format of options changed..
added parenthesis. Is this change documented elsewhere? (sorry if I
just missed this in the rest of the thread/patch)
  
Yes this has already been committed by Tom. The new format of options 
has been introduced just before this patch.


I am attaching a revised version of the patch.
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com



aster-copy-newsyntax-patch-8.5v6context.txt.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-05 Thread Josh Berkus
Emmanuel,

 I think that this was the original idea but we should probably rollback
 the error logging if the command has been rolled back. It might be more
 consistent to use the same hi_options as the copy command. Any idea what
 would be best?

Well, if we're logging to a file, you wouldn't be *able* to roll them
back.  Also, presumbly, if you abort a COPY because of errors, you
probably want to keep the errors around for later analysis.  No?

--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-04 Thread Jeff Davis
On Fri, 2009-09-25 at 10:01 -0400, Emmanuel Cecchet wrote:
 Robert,
 
 Here is the new version of the patch that applies to CVS HEAD as of this 
 morning.

I just started looking at this now. It seems to fail make check, diffs
attached. I haven't looked into the cause of the failure yet.

Regards,
Jeff Davis
*** /home/jdavis/wd/git/postgresql/src/test/regress/expected/copy_errorlogging.out	2009-10-04 10:24:15.0 -0700
--- /home/jdavis/wd/git/postgresql/src/test/regress/results/copy_errorlogging.out	2009-10-04 10:24:32.0 -0700
***
*** 46,58 
  
  -- both \n and \r\n present
  COPY foo FROM '/home/jdavis/wd/git/postgresql/src/test/regress/data/foo_malformed_terminator.data';
- ERROR:  literal carriage return found in data
- HINT:  Use \r to represent carriage return.
- CONTEXT:  COPY foo, line 2: 
  SELECT count(*) from foo;
   count 
   ---
!   0
   (1 row)
  
  -- create error logging table
--- 46,55 
  
  -- both \n and \r\n present
  COPY foo FROM '/home/jdavis/wd/git/postgresql/src/test/regress/data/foo_malformed_terminator.data';
  SELECT count(*) from foo;
   count 
  ---
!  5
  (1 row)
  
  -- create error logging table
***
*** 75,81 
  SELECT count(*) from foo;
   count 
   ---
!   4
   (1 row)
  
  DELETE FROM foo;
--- 72,78 
  SELECT count(*) from foo;
   count 
  ---
!  9
  (1 row)
  
  DELETE FROM foo;
***
*** 101,113 
  DELETE FROM foo;
  -- error logging skipping tuples: both \n and \r\n present
  COPY foo FROM '/home/jdavis/wd/git/postgresql/src/test/regress/data/foo_malformed_terminator.data' (ERROR_LOGGING, ERROR_LOGGING_SKIP_BAD_ROWS);
- ERROR:  literal carriage return found in data
- HINT:  Use \r to represent carriage return.
- CONTEXT:  COPY foo, line 2: 
  SELECT count(*) from foo;
   count 
   ---
!   0
   (1 row)
  
  DELETE FROM foo;
--- 98,107 
  DELETE FROM foo;
  -- error logging skipping tuples: both \n and \r\n present
  COPY foo FROM '/home/jdavis/wd/git/postgresql/src/test/regress/data/foo_malformed_terminator.data' (ERROR_LOGGING, ERROR_LOGGING_SKIP_BAD_ROWS);
  SELECT count(*) from foo;
   count 
  ---
!  5
  (1 row)
  
  DELETE FROM foo;

==


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-04 Thread Selena Deckelmann
Hi!

On Fri, Sep 25, 2009 at 7:01 AM, Emmanuel Cecchet m...@asterdata.com wrote:

 Here is the new version of the patch that applies to CVS HEAD as of this
 morning.

Cool features!

This is my first pass at the error logging portion of this patch. I'm
going to take a break and try to go through the partitioning logic as
well later this afternoon.

caveat: I'm not familiar with most of the code paths that are being
touched by this patch.

Overall:
* I noticed '\see' included in the comments in your code. These should
be removed.
* The regression is failling, as Jeff indicated, and I didn't figure
out why yet either. Hopefully will have a look closer this afternoon.

Comments:
* copy.c: Better formatting, maybe rewording needed for comment
starting on line 1990.
** Maybe say: Check that errorData-sqlerrcode only logged tuples
that are malformed. This ensures that we let other errors pass
through.
* copy.c: line: 2668 - need to fix that comment :) (/* Regular code
*/) -- this needs to be more descriptive of what is actually
happening. We fell through the partitioning logic, right, and back to
the default behavior?
* copy.c: line 2881, maybe say instead: Mark that we have not read a
line yet. This is necessary so that we can perform error logging on
complete lines only.

Formatting:
* copy.c: whitespace is maybe a little off at line: 2004-2009
* NEW FILES: src/backend/utils/misc/errorlogging.c  errorlogging.h need headers

Code:
* copy.c: line 1990 - cur_lineno_str[11]  related: why is this
conversion necessary? (sorry if that is a dumb question)
* copy.c: line 2660 - what if we are error logging for copy? Does
this get logged properly?
* errorlogging.c: Noticed the comment at 503 - 'note that we always
enable wal logging'.. Thinking this through - the reasoning is that
you won't be rolling back the error logging no matter what, right?
* src/include/commands/copy.c and copy.h: struct CopyStateData was
moved from copy.c to copy.h; this made sense to me, just noting. That
move made it a little tricky to find the changes that were made. There
were 10 items added.
** super nit pick: 'readlineOk' uses camel-case instead of underscores
like the rest of the new variables
* errorlogging.c: could move currentCommand pg_stat call in Init
routine into MalformedTupleIs, or maybe move the setting of that
parameter into the Init routine for consistency's sake.

Documentation:
* doc/src/sgml/ref/copy.sgml: line 313: 'failif' needs a space
** Also: The error table log examples have relations shown in a
different order than the actual CREATE TABLE declaration in the code.
* src/test/regress/sql/copyselect.sql: Format of options changed..
added parenthesis. Is this change documented elsewhere? (sorry if I
just missed this in the rest of the thread/patch)

-- 
http://chesnok.com/daily - me
http://endpoint.com - work

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-10-04 Thread Emmanuel Cecchet

The problem comes from the foo_malformed_terminator.data file.
It is supposed to have a malformed terminator that was not catch by 
patch. The second line should look like:

2   two^M

If it does not, you can edit it with emacs, go at the end of the second 
line and press Ctrl+q followed by Ctrl+m and that will do the trick.
I am attaching a copy of the file, hoping that the attachment will 
arrive with the malformed terminator in your inbox!


manu


Jeff Davis wrote:

On Fri, 2009-09-25 at 10:01 -0400, Emmanuel Cecchet wrote:
  

Robert,

Here is the new version of the patch that applies to CVS HEAD as of this 
morning.



I just started looking at this now. It seems to fail make check, diffs
attached. I haven't looked into the cause of the failure yet.

Regards,
Jeff Davis
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

1   one
2   two
3   three
4   four
5   five

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-25 Thread Emmanuel Cecchet

Robert,

Here is the new version of the patch that applies to CVS HEAD as of this 
morning.


Emmanuel


On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet m...@asterdata.com wrote:
  

Here is a new version of error logging and autopartitioning in COPY based on
the latest COPY patch that provides the new syntax for copy options (this
patch also includes the COPY option patch).

New features compared to previous version:
- removed the GUC variables for error logging and use copy options instead
(renamed options as suggested by Josh)
- added documentation and examples (see copy.sgml)
- partitioning also activated by copy option rather than GUC variable
(renamed as well)
- added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
operation if a certain number of bad rows has been encountered.
- updated unit tests

I also tried to update the wiki pages but it's late and the doc is probably
better for now.

This addresses most of the comments so far except for the format of the
error table (lack of natural key) and a possible pg_error_table as a default
name.

Emmanuel

Emmanuel Cecchet wrote:


Hi all,

Finally the error logging and autopartitioning patches for COPY that I
presented at PGCon are here!
Error logging is described here:
http://wiki.postgresql.org/wiki/Error_logging_in_COPY
Autopartitioning is described here:
http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY

The attached patch contains both contribs as well as unit tests. I will
submit shortly the new patch at commitfest.postgresql.org.

Thanks in advance for your feedback,
Emmanuel
  


This no longer applies.

...Robert
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com



aster-copy-newsyntax-patch-8.5v5context.txt.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-24 Thread Robert Haas
On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 Here is a new version of error logging and autopartitioning in COPY based on
 the latest COPY patch that provides the new syntax for copy options (this
 patch also includes the COPY option patch).

 New features compared to previous version:
 - removed the GUC variables for error logging and use copy options instead
 (renamed options as suggested by Josh)
 - added documentation and examples (see copy.sgml)
 - partitioning also activated by copy option rather than GUC variable
 (renamed as well)
 - added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
 operation if a certain number of bad rows has been encountered.
 - updated unit tests

 I also tried to update the wiki pages but it's late and the doc is probably
 better for now.

 This addresses most of the comments so far except for the format of the
 error table (lack of natural key) and a possible pg_error_table as a default
 name.

 Emmanuel

 Emmanuel Cecchet wrote:

 Hi all,

 Finally the error logging and autopartitioning patches for COPY that I
 presented at PGCon are here!
 Error logging is described here:
 http://wiki.postgresql.org/wiki/Error_logging_in_COPY
 Autopartitioning is described here:
 http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY

 The attached patch contains both contribs as well as unit tests. I will
 submit shortly the new patch at commitfest.postgresql.org.

 Thanks in advance for your feedback,
 Emmanuel

This no longer applies.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-24 Thread Emmanuel Cecchet
Yes, I have to update the patch following what Tom already integrated of 
the COPY patch.

I will get a new version posted as soon as I can.

Emmanuel

Robert Haas wrote:

On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet m...@asterdata.com wrote:
  

Here is a new version of error logging and autopartitioning in COPY based on
the latest COPY patch that provides the new syntax for copy options (this
patch also includes the COPY option patch).

New features compared to previous version:
- removed the GUC variables for error logging and use copy options instead
(renamed options as suggested by Josh)
- added documentation and examples (see copy.sgml)
- partitioning also activated by copy option rather than GUC variable
(renamed as well)
- added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
operation if a certain number of bad rows has been encountered.
- updated unit tests

I also tried to update the wiki pages but it's late and the doc is probably
better for now.

This addresses most of the comments so far except for the format of the
error table (lack of natural key) and a possible pg_error_table as a default
name.

Emmanuel

Emmanuel Cecchet wrote:


Hi all,

Finally the error logging and autopartitioning patches for COPY that I
presented at PGCon are here!
Error logging is described here:
http://wiki.postgresql.org/wiki/Error_logging_in_COPY
Autopartitioning is described here:
http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY

The attached patch contains both contribs as well as unit tests. I will
submit shortly the new patch at commitfest.postgresql.org.

Thanks in advance for your feedback,
Emmanuel
  


This no longer applies.

...Robert
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-19 Thread Bruce Momjian
Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
  It's not as if we don't have the ability to measure performance impact.
   It's reasonable to make a requirement that new options to COPY
  shouldn't slow it down noticeably if those options aren't used.  And we
  can test that, and even make such testing part of the patch review.
 
 Really?  Where is your agreed-on, demonstrated-to-be-reproducible
 benchmark for COPY speed?
 
 My experience is that reliably measuring performance costs in the
 percent-or-so range is *hard*.  It's only after you've added a few of
 them and they start to mount up that it becomes obvious that all those
 insignificant additions really did cost something.
 
 But in any case, I think that having a clear distinction between
 straight data import and data transformation features is a good
 thing.  COPY is already pretty much of an unmanageable monstrosity,
 and continuing to accrete features into it without any sort of structure
 is something we are going to regret.

I have read up on this thread and the new copy syntax thread. I think
there is clearly documented demand for such extensions to COPY.

We are definitely opening the floodgates by allowing COPY to process
invalid data.  I think everyone admits COPY is already quite
complicated, both in its API and C code.

If we are going to add to COPY, I think we need to do it in a way that
has a clean user API, doesn't make the C code any more complicated, and
doesn't introduce a performance impact for people not using these new
features.  If we don't do that, we are going to end up like 'bcp' that
is perpetually buggy, as someone explained.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-14 Thread Emmanuel Cecchet

Greg Smith wrote:

On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:

I guess the problem with extra or missing columns is to make sure 
that you know exactly which data belongs to which column so that you 
don't put data in the wrong columns which is likely to happen if this 
is fully automated.


Allowing the extra column case is easy:  everwhere in copy.c you find 
the error message extra data after last expected column, just ignore 
the overflow fields rather than rejecting the line just based on 
that.  And the default information I mentioned you might want to 
substitute for missing columns is already being collected by the code 
block with the comment Get default info if needed.
If I understand it well, you expect the garbage to be after the last 
column. But what if the extra or missing column is somewhere upfront or 
in the middle? Sometimes you might have a type conflict problem that 
will help you detect the problem, sometimes you will just insert 
garbage. This might call for another mechanism that would log the lines 
that are automatically 'adjusted' to be able to rollback any mistake 
that might happen during this automated process.


Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-14 Thread Andrew Dunstan



Emmanuel Cecchet wrote:

Greg Smith wrote:

On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:

I guess the problem with extra or missing columns is to make sure 
that you know exactly which data belongs to which column so that you 
don't put data in the wrong columns which is likely to happen if 
this is fully automated.


Allowing the extra column case is easy:  everwhere in copy.c you find 
the error message extra data after last expected column, just 
ignore the overflow fields rather than rejecting the line just based 
on that.  And the default information I mentioned you might want to 
substitute for missing columns is already being collected by the code 
block with the comment Get default info if needed.
If I understand it well, you expect the garbage to be after the last 
column. But what if the extra or missing column is somewhere upfront 
or in the middle? Sometimes you might have a type conflict problem 
that will help you detect the problem, sometimes you will just insert 
garbage. This might call for another mechanism that would log the 
lines that are automatically 'adjusted' to be able to rollback any 
mistake that might happen during this automated process.






Garbage off to the right is exactly the case that we have. Judging from 
what I'm hearing a number of other people are too.


Nobody suggests that a facility to ignore extra columns will handle 
every case. It will handle what increasingly appears to be a common case.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-13 Thread Josh Berkus
Tom,

 [ shrug... ]  Everybody in the world is going to want their own little
 problem to be handled in the fast path.  And soon it won't be so fast
 anymore.  I think it is perfectly reasonable to insist that the fast
 path is only for clean data import.

Why?

No, really.

It's not as if we don't have the ability to measure performance impact.
 It's reasonable to make a requirement that new options to COPY
shouldn't slow it down noticeably if those options aren't used.  And we
can test that, and even make such testing part of the patch review.

But ... fault-tolerant COPY is one of our biggest user
requests/complaints.  At user group meetings and the like, I get asked
about it probably every third gathering of users I'm at.  While it's not
as critical as log-based replication, it's also not nearly as hard to
integrate and review.

I fully support the idea that we need to have the extended syntax for
these new COPY options.  But we should make COPY take an alternate path
for fault-tolerant COPY only if it's shown that adding these options
slows down database restore.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-13 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 It's not as if we don't have the ability to measure performance impact.
  It's reasonable to make a requirement that new options to COPY
 shouldn't slow it down noticeably if those options aren't used.  And we
 can test that, and even make such testing part of the patch review.

Really?  Where is your agreed-on, demonstrated-to-be-reproducible
benchmark for COPY speed?

My experience is that reliably measuring performance costs in the
percent-or-so range is *hard*.  It's only after you've added a few of
them and they start to mount up that it becomes obvious that all those
insignificant additions really did cost something.

But in any case, I think that having a clear distinction between
straight data import and data transformation features is a good
thing.  COPY is already pretty much of an unmanageable monstrosity,
and continuing to accrete features into it without any sort of structure
is something we are going to regret.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-13 Thread Andrew Dunstan



Tom Lane wrote:

Josh Berkus j...@agliodbs.com writes:
  

It's not as if we don't have the ability to measure performance impact.
 It's reasonable to make a requirement that new options to COPY
shouldn't slow it down noticeably if those options aren't used.  And we
can test that, and even make such testing part of the patch review.



Really?  Where is your agreed-on, demonstrated-to-be-reproducible
benchmark for COPY speed?

My experience is that reliably measuring performance costs in the
percent-or-so range is *hard*.  It's only after you've added a few of
them and they start to mount up that it becomes obvious that all those
insignificant additions really did cost something.
  


Well, I strongly suspect that the cost of the patch I'm working with is 
not in the percent-or-so range, and much more likely to be in the 
tiny-fraction-of-a-percent range. The total overhead in the non-ragged 
case is one extra test per field, plus one per null field, plus two 
tests per line.


But since you raise the question I'll conduct some tests and then you 
can criticize those. Ruling out tests a priori seems a bit extreme.


The current patch is attached for information (and in case anyone else 
wants to try some testing).


cheers

andrew
Index: src/backend/commands/copy.c
===
RCS file: /cvsroot/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.316
diff -c -r1.316 copy.c
*** src/backend/commands/copy.c	29 Jul 2009 20:56:18 -	1.316
--- src/backend/commands/copy.c	13 Sep 2009 02:57:16 -
***
*** 116,121 
--- 116,122 
  	char	   *escape;			/* CSV escape char (must be 1 byte) */
  	bool	   *force_quote_flags;		/* per-column CSV FQ flags */
  	bool	   *force_notnull_flags;	/* per-column CSV FNN flags */
+ 	boolragged; /* allow ragged CSV input? */
  
  	/* these are just for error messages, see copy_in_error_callback */
  	const char *cur_relname;	/* table name for error messages */
***
*** 822,827 
--- 823,836 
  		 errmsg(conflicting or redundant options)));
  			force_notnull = (List *) defel-arg;
  		}
+ 		else if (strcmp(defel-defname, ragged) == 0)
+ 		{
+ 			if (cstate-ragged)
+ ereport(ERROR,
+ 		(errcode(ERRCODE_SYNTAX_ERROR),
+ 		 errmsg(conflicting or redundant options)));
+ 			cstate-ragged = intVal(defel-arg);
+ 		}
  		else
  			elog(ERROR, option \%s\ not recognized,
   defel-defname);
***
*** 948,953 
--- 957,972 
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  			  errmsg(COPY force not null only available using COPY FROM)));
  
+ 	/* Check ragged */
+ 	if (!cstate-csv_mode  cstate-ragged)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg(COPY ragged available only in CSV mode)));
+ 	if (cstate-ragged  !is_from)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 			  errmsg(COPY  ragged only available using COPY FROM)));
+ 
  	/* Don't allow the delimiter to appear in the null string. */
  	if (strchr(cstate-null_print, cstate-delim[0]) != NULL)
  		ereport(ERROR,
***
*** 2951,2964 
  		int			input_len;
  
  		/* Make sure space remains in fieldvals[] */
! 		if (fieldno = maxfields)
  			ereport(ERROR,
  	(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
  	 errmsg(extra data after last expected column)));
  
  		/* Remember start of field on both input and output sides */
  		start_ptr = cur_ptr;
! 		fieldvals[fieldno] = output_ptr;
  
  		/*
  		 * Scan data for field,
--- 2970,2984 
  		int			input_len;
  
  		/* Make sure space remains in fieldvals[] */
! 		if (fieldno = maxfields  ! cstate-ragged)
  			ereport(ERROR,
  	(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
  	 errmsg(extra data after last expected column)));
  
  		/* Remember start of field on both input and output sides */
  		start_ptr = cur_ptr;
! 		if (fieldno  maxfields)
! 			fieldvals[fieldno] = output_ptr;
  
  		/*
  		 * Scan data for field,
***
*** 3045,3051 
  		/* Check whether raw input matched null marker */
  		input_len = end_ptr - start_ptr;
  		if (!saw_quote  input_len == cstate-null_print_len 
! 			strncmp(start_ptr, cstate-null_print, input_len) == 0)
  			fieldvals[fieldno] = NULL;
  
  		fieldno++;
--- 3065,3072 
  		/* Check whether raw input matched null marker */
  		input_len = end_ptr - start_ptr;
  		if (!saw_quote  input_len == cstate-null_print_len 
! 			strncmp(start_ptr, cstate-null_print, input_len) == 0 
! 			fieldno  maxfields)
  			fieldvals[fieldno] = NULL;
  
  		fieldno++;
***
*** 3059,3065 
  	Assert(*output_ptr == '\0');
  	cstate-attribute_buf.len = (output_ptr - cstate-attribute_buf.data);
  
! 	return fieldno;
  }
  
  
--- 3080,3092 
  	Assert(*output_ptr == '\0');
  	cstate-attribute_buf.len = (output_ptr - cstate-attribute_buf.data);
  
! 	/* for ragged input, set field null for underflowed fields */
! 	

Re: [HACKERS] COPY enhancements

2009-09-12 Thread Heikki Linnakangas
Josh Berkus wrote:
 The performance of every path to get data into the database besides COPY
 is too miserable for us to use anything else, and the current
 inflexibility makes it useless for anything but the cleanest input data.
 
 One potential issue we're facing down this road is that current COPY has
 a dual purpose: for database restore, and for importing and exporting
 data.  At some point, we may want to separate those two behaviors,
 because we'll be adding bells and fringes to import/export which slow
 down overall performance or add bugs.

+1. There is an infinite number of bells and whistles we could add to
COPY, and there's also a number of further optimizations that would make
the loading faster. But the code is quite a mess already, because it's
already highly optimized at the expense of readibility. We need to
separate the input parsing from the fast bulk insertion.

Letting external modules replace the input parsing part would allow you
to a write parser for any input format you like. You could even get the
input from a different source altogether, like from another database via
dblink, in a binary format of some sort.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Heikki Linnakangas
Josh Berkus wrote:
 The user-defined table for rejects is obviously exclusive of the system
 one, either of those would be fine from my perspective.
 
 I've been thinking about it, and can't come up with a really strong case
 for wanting a user-defined table if we settle the issue of having a
 strong key for pg_copy_errors.  Do you have one?

A user-defined error callback function would be nice. The function could
insert the erroneous rows to a table, write to a file, print a warning,
perform further checks, or whatever. This assumes that errors are seldom
enough that the error-path is not performance-critical.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Greg Smith

On Fri, 11 Sep 2009, Josh Berkus wrote:


I've been thinking about it, and can't come up with a really strong case
for wanting a user-defined table if we settle the issue of having a
strong key for pg_copy_errors.  Do you have one?


No, but I'd think that if the user table was only allowed to be the exact 
same format as the system one it wouldn't be that hard to implement--once 
the COPY syntax is expanded at least.  I'm reminded of how Oracle EXPLAIN 
PLANs get logged into the PLAN_TABLE by default, but you can specify INTO 
table to put them somewhere else.  You'd basically doing the same thing 
but with a different destination relation.



After some thought, I think that Andrew's feature *is* generally
applicable, if done as IGNORE COLUMN COUNT (or, more likely,
column_count=ignore).  I can think of a lot of data sets where column
count is jagged and you want to do ELT instead of ETL.


Exactly, the ELT approach gives you so many more options for cleaning up 
the data that I think it would be used more if it weren't so hard to 
do in Postgres right now.


As opposed to Tom, Peter and Heikki vetoing things because the feature 
gain doesn't justify the maintnenance burden?  That's your real choice. 
Adding a framework for manageable syntax extensions means that we can be 
more liberal about what we justify as an extension.


I think you're not talking at the distinction I was trying to make.  The 
work to make the *syntax* for COPY easier to extend is an unfortunate 
requirement for all these new bits; no arguments from me that using GUCs 
for everything is just too painful


What I was suggesting is that the first set of useful features required 
for what you're calling the ELT load path is both small and well 
understood.  An implementation of the stuff I see a constant need for 
could get banged out so fast that trying to completely generalize it on 
the first pass has a questionable return.


While complicated, COPY is a pretty walled off command of around 3500 
lines of code, and the hackery required here is pretty small.  For 
example, it turns out we do already have the code to get it to ignore 
column overruns here, and it's all of 50 new lines--much of which is 
shared with code that does other error ignoring bits too.  It's easy to 
make a case for a grand future extensibility cleanup here, but it's really 
not necessary to provide a significant benefit here for the cases I 
mentioned.  And I would guess the maintenance burden of a more general 
solution has to be higher than a simple implementation of the feature list 
I gave in my last message.


In short:  there's a presumption that adding any error-ignoring code would 
require significant contortions.  I don't think that's really true though, 
and would like to keep open the possibilty of accepting some simple but 
useful ad-hoc features in this area, even if they don't solve every 
possible problem in this space just yet.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Greg Smith

On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:

I guess the problem with extra or missing columns is to make sure that 
you know exactly which data belongs to which column so that you don't 
put data in the wrong columns which is likely to happen if this is fully 
automated.


Allowing the extra column case is easy:  everwhere in copy.c you find the 
error message extra data after last expected column, just ignore the 
overflow fields rather than rejecting the line just based on that.  And 
the default information I mentioned you might want to substitute for 
missing columns is already being collected by the code block with the 
comment Get default info if needed.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Andrew Dunstan



Greg Smith wrote:

After some thought, I think that Andrew's feature *is* generally
applicable, if done as IGNORE COLUMN COUNT (or, more likely,
column_count=ignore). I can think of a lot of data sets where column
count is jagged and you want to do ELT instead of ETL.


Exactly, the ELT approach gives you so many more options for cleaning 
up the data that I think it would be used more if it weren't so hard 
to do in Postgres right now.





+1. That's exactly what my client wants to do. They know perfectly well 
that they get junk data. They want to get it into the database with a 
minimum of fuss where they will have the right tools for checking and 
cleaning it. If they have to spend effort whacking it into shape just to 
get it into the database, then their cleanup effort essentially has to 
be done in two pieces, part inside and part outside the database.





While complicated, COPY is a pretty walled off command of around 3500 
lines of code, and the hackery required here is pretty small. For 
example, it turns out we do already have the code to get it to ignore 
column overruns here, and it's all of 50 new lines--much of which is 
shared with code that does other error ignoring bits too. It's easy to 
make a case for a grand future extensibility cleanup here, but it's 
really not necessary to provide a significant benefit here for the 
cases I mentioned. And I would guess the maintenance burden of a more 
general solution has to be higher than a simple implementation of the 
feature list I gave in my last message.


In short: there's a presumption that adding any error-ignoring code 
would require significant contortions. I don't think that's really 
true though, and would like to keep open the possibilty of accepting 
some simple but useful ad-hoc features in this area, even if they 
don't solve every possible problem in this space just yet.





Right. What I proposed would not have been terribly invasive or 
difficult, certainly less so than what seems to be our direction by an 
order of magnitude at least. I don't for a moment accept the assertion 
that we can get a general solution for the same effort.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Right. What I proposed would not have been terribly invasive or 
 difficult, certainly less so than what seems to be our direction by an 
 order of magnitude at least. I don't for a moment accept the assertion 
 that we can get a general solution for the same effort.

And at the same time, Greg's list of minimum requirements was far
longer than what you proposed to do.  We can *not* just implement
those things one at a time with no thought towards what the full
solution looks like --- at least not if we want the end result to
look like it was intelligently designed, not merely accreted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:
  
Right. What I proposed would not have been terribly invasive or 
difficult, certainly less so than what seems to be our direction by an 
order of magnitude at least. I don't for a moment accept the assertion 
that we can get a general solution for the same effort.



And at the same time, Greg's list of minimum requirements was far
longer than what you proposed to do.  We can *not* just implement
those things one at a time with no thought towards what the full
solution looks like --- at least not if we want the end result to
look like it was intelligently designed, not merely accreted.


  


I don't disagree with that.

At the same time, I think it's probably not a good thing that users who 
deal with very large amounts of data would be forced off the COPY fast 
path by a need for something like input support for non-rectangular 
data. It probably won't affect my clients too much in this instance, but 
then their largest loads are usually of the order of only 50,000 records 
or so. I understand Truviso has handled this by a patch that does the 
sort of stuff Greg was talking about.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 At the same time, I think it's probably not a good thing that users who 
 deal with very large amounts of data would be forced off the COPY fast 
 path by a need for something like input support for non-rectangular 
 data.

[ shrug... ]  Everybody in the world is going to want their own little
problem to be handled in the fast path.  And soon it won't be so fast
anymore.  I think it is perfectly reasonable to insist that the fast
path is only for clean data import.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-12 Thread Greg Smith

On Sat, 12 Sep 2009, Tom Lane wrote:

Everybody in the world is going to want their own little problem to be 
handled in the fast path.  And soon it won't be so fast anymore.  I 
think it is perfectly reasonable to insist that the fast path is only 
for clean data import.


The extra overhead is that when you hit the checks that are already in the 
code, where the row would normally be rejected, there's a second check as 
to whether that particular problem is considered OK or not.  There won't 
be any additional overhead for clean imports.  As I was pointing out in 
one of the messages in this thread, all of the expensive things you need 
are already being done.


As for everybody in the world wanting a specific fix for their private 
problems, I assure that everything I suggested comes up constantly on 
every legacy data conversion or import job I see.  This is not stuff that 
fits a one-off need, these are things that make it harder for people to 
adopt PostgreSQL all the time.  I wouldn't care about this one bit if 
these particular issues didn't ruin my day constantly.  Consider the fact 
that we're looking at three samples of people who have either already 
written a patch in this area or considered writing one showing up just 
among people on the hackers list.  That should be hint as to how common 
these requests are.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Emmanuel Cecchet

Hi Robert,


I like this idea, perhaps not surprisingly (for those not following at
home: that was my patch).  Unfortunately, it looks to me like there is
no way to do this without overhauling the syntax.  If the existing
syntax required a comma between options (i.e. copy blah to stdout
binary, oids rather than copy to stdout binary oids, this would be
pretty straightforward; but it doesn't even allow one).
  
Well some options like CSV FORCE ... take a comma separated list of 
columns. This would require all options to become reserved keywords or 
force parenthesis around option parameters.

I wonder if we should consider allowing COPY options to be
comma-separated beginning with 8.5, and then require it in 8.6.  Other
options include continuing to support the old syntax for the existing
options, but allowing some new syntax as well and requiring its use
for all new options (this is what we did with EXPLAIN, but there were
only two pre-existing options there), and just changing the syntax
incompatibly and telling users to suck it up.  But I'm not sure I like
either of those choices.
  
We could keep the current syntax for backward compatibility only (can be 
dropped in a future release) and have a new syntax (starting in 8.5). To 
avoid confusion between both, we could just replace WITH with something 
else (or just drop it) to indicate that this is the new syntax.


The new syntax could look like:

COPY /tablename/ [ ( /column/ [, ...] ) ]
   FROM { '/filename/' | STDIN }
   [ [, BINARY ]
 [, OIDS ]
 [, DELIMITER [ AS ] '/delimiter/' ]
 [, NULL [ AS ] '/null string/' ]
 [, CSV [ HEADER ]
[ QUOTE [ AS ] '/quote/' ] 
[ ESCAPE [ AS ] '/escape/' ]

[ FORCE NOT NULL (/column/ [, ...]) ]
 [, ERRORS { SKIP | 
 LOG INTO { tablename | 'filename' }

   [ LABEL label_name ]
   [ KEY key_name ]
   [ MAX ERRORS /count/ ] } ]


Is this what you had in mind?

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com



Re: [HACKERS] COPY enhancements

2009-09-11 Thread Tom Lane
Emmanuel Cecchet m...@asterdata.com writes:
 The new syntax could look like:

 COPY /tablename/ [ ( /column/ [, ...] ) ]
 FROM { '/filename/' | STDIN }
 [ [, BINARY ]
   [, OIDS ]
   [, DELIMITER [ AS ] '/delimiter/' ]
   [, NULL [ AS ] '/null string/' ]
   [, CSV [ HEADER ]
  [ QUOTE [ AS ] '/quote/' ] 
  [ ESCAPE [ AS ] '/escape/' ]
  [ FORCE NOT NULL (/column/ [, ...]) ]
   [, ERRORS { SKIP | 
   LOG INTO { tablename | 'filename' }
 [ LABEL label_name ]
 [ KEY key_name ]
 [ MAX ERRORS /count/ ] } ]

 Is this what you had in mind?

No. because that doesn't do a darn thing to make the option set less
hard-wired into the syntax.  I was thinking of a strict keyword/value
format with non-wired-in keywords ... and only *one* keyword per value.
See EXPLAIN.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Emmanuel Cecchet

Tom,

I looked at EXPLAIN 
(http://www.postgresql.org/docs/current/interactive/sql-explain.html) 
and there is not a single line of what you are talking about.

And the current syntax is just EXPLAIN [ ANALYZE ] [ VERBOSE ] /statement
/
If I try to decrypt what you said, you are looking at something like

COPY /tablename/ [ ( /column/ [, ...] ) ]
   FROM { '/filename/' | STDIN }
   [option1=value[,...]]

That would give something like:
COPY foo FROM 'input.txt' binary=on, oids=on, errors=skip, max_errors=10

If this is not what you are thinking, please provide an example.

Emmanuel
/

/

Emmanuel Cecchet m...@asterdata.com writes:
  

The new syntax could look like:



  

COPY /tablename/ [ ( /column/ [, ...] ) ]
FROM { '/filename/' | STDIN }
[ [, BINARY ]
  [, OIDS ]
  [, DELIMITER [ AS ] '/delimiter/' ]
  [, NULL [ AS ] '/null string/' ]
  [, CSV [ HEADER ]
 [ QUOTE [ AS ] '/quote/' ] 
 [ ESCAPE [ AS ] '/escape/' ]

 [ FORCE NOT NULL (/column/ [, ...]) ]
  [, ERRORS { SKIP | 
  LOG INTO { tablename | 'filename' }

[ LABEL label_name ]
[ KEY key_name ]
[ MAX ERRORS /count/ ] } ]



  

Is this what you had in mind?



No. because that doesn't do a darn thing to make the option set less
hard-wired into the syntax.  I was thinking of a strict keyword/value
format with non-wired-in keywords ... and only *one* keyword per value.
See EXPLAIN.

regards, tom lane
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Robert Haas
On Fri, Sep 11, 2009 at 10:53 AM, Emmanuel Cecchet m...@asterdata.com wrote:
 Tom,

 I looked at EXPLAIN
 (http://www.postgresql.org/docs/current/interactive/sql-explain.html) and
 there is not a single line of what you are talking about.
 And the current syntax is just EXPLAIN [ ANALYZE ] [ VERBOSE ] /statement
 /

http://developer.postgresql.org/pgdocs/postgres/sql-explain.html

Or look at your CVS/git checkout.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Robert Haas
On Fri, Sep 11, 2009 at 10:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Emmanuel Cecchet m...@asterdata.com writes:
 The new syntax could look like:

 COPY /tablename/ [ ( /column/ [, ...] ) ]
     FROM { '/filename/' | STDIN }
     [ [, BINARY ]
       [, OIDS ]
       [, DELIMITER [ AS ] '/delimiter/' ]
       [, NULL [ AS ] '/null string/' ]
       [, CSV [ HEADER ]
              [ QUOTE [ AS ] '/quote/' ]
              [ ESCAPE [ AS ] '/escape/' ]
              [ FORCE NOT NULL (/column/ [, ...]) ]
       [, ERRORS { SKIP |
                   LOG INTO { tablename | 'filename' }
                     [ LABEL label_name ]
                     [ KEY key_name ]
                     [ MAX ERRORS /count/ ] } ]

 Is this what you had in mind?

 No. because that doesn't do a darn thing to make the option set less
 hard-wired into the syntax. I was thinking of a strict keyword/value
 format with non-wired-in keywords ... and only *one* keyword per value.
 See EXPLAIN.

I was thinking something like:

COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
[WITH] [option [, ...]]

Where:

option := ColId [Sconst] | FORCE NOT NULL (column [,...])

I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
into a keyword/value notation.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Or look at your CVS/git checkout.

The important point is to look at the grammar, which doesn't have any
idea what the specific options are in the list.  (Well, okay, it had
to have special cases for ANALYZE and VERBOSE because those are reserved
words :-(.  But future additions will not need to touch the grammar.
In the case of COPY that also means not having to touch psql \copy.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
 into a keyword/value notation.

Any number of ways, for example force_not_null = true or multiple
occurrences of force_not_null = column_name.  Andrew was on the verge
of admitting we don't need that option anymore anyway ;-), so I don't
think we should allow it to drive an exception to the simplified syntax.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Pierre Frédéric Caillau d

I was thinking something like:

COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
[WITH] [option [, ...]]

Where:

option := ColId [Sconst] | FORCE NOT NULL (column [,...])

I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
into a keyword/value notation.


	Postgres has a hstore data type which seems well suited for passing  
key/value option pairs...
	Why another syntax to remember, another parser to code, when almost  
everything is already there ?


	Think about plpgsql code which generates some SQL COPY command string,  
then this is parsed and executed... wouldn't it be a lot simpler to just  
manipulate parameters in a hstore ?...



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Robert Haas
On Fri, Sep 11, 2009 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
 into a keyword/value notation.

 Any number of ways, for example force_not_null = true or multiple
 occurrences of force_not_null = column_name.  Andrew was on the verge
 of admitting we don't need that option anymore anyway ;-), so I don't
 think we should allow it to drive an exception to the simplified syntax.

While I'm at least as big a fan of generic options as the next person,
syntax is cheap.  I don't see any reason to get worked up about one
exception to a generic options syntax.  If the feature is useless, of
course we can rip it out, but that's a separate discussion.  For what
it's worth, I think your proposed alternative is ugly and an abuse of
the idea of keyword-value pairs.  In the EXPLAIN-world, a later value
for the same option overrides a previous assignment earlier in the
list, and I am in favor of sticking with that approach.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY enhancements

2009-09-11 Thread Robert Haas
2009/9/11 Pierre Frédéric Caillaud li...@peufeu.com:
 I was thinking something like:

 COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
 [WITH] [option [, ...]]

 Where:

 option := ColId [Sconst] | FORCE NOT NULL (column [,...])

 I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
 into a keyword/value notation.

        Postgres has a hstore data type which seems well suited for passing
 key/value option pairs...
        Why another syntax to remember, another parser to code, when almost
 everything is already there ?

        Think about plpgsql code which generates some SQL COPY command
 string, then this is parsed and executed... wouldn't it be a lot simpler to
 just manipulate parameters in a hstore ?...

I doubt it very much.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >