Re: [HACKERS] Copy From Insert UNLESS

2006-02-14 Thread James William Pye
On Mon, Feb 06, 2006 at 05:08:38PM -0500, Alon Goldshuv wrote:
 The proposal is neat, however, I am not too
 excited about handling errors in such high granularity, as far as the user
 is concerned. I am more on the same line with Tom Lane's statement in
 Simon's thread (Practical error logging for very large COPY statements):
 
 The general problem that needs to be solved is trap any error that
 occurs during attempted insertion of a COPY row, and instead of aborting
 the copy, record the data and the error message someplace else.  Seen
 in that light, implementing a special path for uniqueness violations is
 pretty pointless.

I think I would be inclined to actually agree with that, which is why I proposed
a special path for constraint violations in general as opposed to just
uniqueness. However, I can understand if you remain unmoved. ;)


 But, I definitely share your struggle to finding a good way to handle those
 unique/FK constraints...

Aye, :-(


 As far as UNIQUE goes, maybe there is a good way to do a bt scan against the
 index table right before the simple_heap_insert call?

Yes, but I believe some additional locking is required in order to make that
safe. Not that that would kill it, but I think there is a better way; I'm
cooking up a general proposal for refactoring unique constraints, so I'm hoping
something along those lines will aid any patch attempting to solving this
problem[copy error/violation management].
-- 
Regards, James William Pye

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread James William Pye
On Sun, Feb 05, 2006 at 07:14:49PM -0800, Stephan Szabo wrote:
 On Sun, 5 Feb 2006, James William Pye wrote:
  However, constraints referenced in an UNLESS clause that are deferred, in 
  any
  fashion, should probably be immediated within the context of the command.
  Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS 
  were
  to actually alter the timing of a given constraint.
 
 The problem is that even immediate constraints are supposed to be checked
 at end of statement, not at row time.

I see. Immediated is not the word that I am actually looking for then. :(
Perhaps Postgres should specify our current immediate as a new constraint 
mode.
instant, maybe? Sadly, I think it will be difficult to get away from using 
that or
some other synonym if such an idea were to be implemented.

[Getting the feeling that this has been discussed before. ;]

 Our implementation of UNIQUE is particularly bad for this.

Yes. Changing how UNIQUE constraints are implemented will likely be the first
step in this patch.

  Any facility that can alter the tuple before it being inserted into the heap
  should probably be exercised prior to the application of the tuple against
  UNLESS's behavior.
 
 The problem is that you can un-violate a unique constraint by changing
 some other row that's already in the table. And I think that it might even
 be legal to do so in an after trigger (and in fact, some other row's after
 trigger).
 [join]
 Basically a violation at the time the row is
 created is irrelevant if the violation is gone by the end of statement.

Okay. I can't help but think such a trigger as being questionable at best.
However, per spec, it should be possible. =\

 This isn't necessarily a killer to the idea though, it probably just means
 the semantics are harder to nail down.

Aye. I figured there would be some details that might take a while.


Once the UNIQUE constraint code is relocated, I think implementing more
standards compliant constraint timing might be substantially easier. However, I
don't think this should effect UNLESS. Rather, I think UNLESS should, more or
less, demand that specified constraints be checked at the same time as they are
currently. This is meant to be an optimization at multiple levels; reduce code
redundancy(rewriting constraint checks for use prior to the actual insertion),
computational redundancy(potentially, running the rewritten checks more than
once), and reduce unnecessary I/O(avoiding heap_insert()'ing an evil tuple
into the target table despite the fact that the statement may later inviolate
it). Although, perhaps, it could be configurable with an option;
INSERT INTO t UNLESS [DEFERRED] CONSTRAINT VIOLATION. =)
-- 
Regards, James William Pye

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread Josh Berkus

James,

Are you sure that a new type of constraint is the way to go for this? 
It doesn't solve our issues in the data warehousing space.  The spec we 
started with for Error-tolerant COPY is:


1) It must be able to handle parsing errors (i.e. bad char set);
2) It must be able to handle constraint violations;
3) It must output all row errors to a log or errors table which makes 
it possible to determine which input row failed and why;
4) It must not slow significantly (like, not more than 15%) the speed of 
bulk loading.


On that basis, Alon started working on a low-level error trapper for 
COPY.   It seems like your idea, which would involve a second constraint 
check, would achieve neigher #1 nor #4.


--Josh Berkus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread Stephan Szabo
On Mon, 6 Feb 2006, Josh Berkus wrote:

 Are you sure that a new type of constraint is the way to go for this?
 It doesn't solve our issues in the data warehousing space.  The spec we
 started with for Error-tolerant COPY is:

 1) It must be able to handle parsing errors (i.e. bad char set);
 2) It must be able to handle constraint violations;
 3) It must output all row errors to a log or errors table which makes
 it possible to determine which input row failed and why;
 4) It must not slow significantly (like, not more than 15%) the speed of
 bulk loading.

 On that basis, Alon started working on a low-level error trapper for
 COPY.   It seems like your idea, which would involve a second constraint
 check, would achieve neigher #1 nor #4.

I think in his system it wouldn't check the constraints twice, it'd just
potentially check them at a different time than the normal constraint
timing, so I think it'd cover #4. I'd wonder if there'd be any possibility
of having violations get unnoticed in that case, but I'm not coming up
with an obvious way that could happen.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread Stephan Szabo
On Mon, 6 Feb 2006, James William Pye wrote:

 On Sun, Feb 05, 2006 at 07:14:49PM -0800, Stephan Szabo wrote:
  On Sun, 5 Feb 2006, James William Pye wrote:
   However, constraints referenced in an UNLESS clause that are deferred, in 
   any
   fashion, should probably be immediated within the context of the 
   command.
   Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS 
   were
   to actually alter the timing of a given constraint.
 
  The problem is that even immediate constraints are supposed to be checked
  at end of statement, not at row time.

 I see. Immediated is not the word that I am actually looking for then. :(
 Perhaps Postgres should specify our current immediate as a new constraint 
 mode.
 instant, maybe? Sadly, I think it will be difficult to get away from using 
 that or
 some other synonym if such an idea were to be implemented.

 [Getting the feeling that this has been discussed before. ;]

Only parts of it. :)

  Our implementation of UNIQUE is particularly bad for this.

 Yes. Changing how UNIQUE constraints are implemented will likely be the first
 step in this patch.

   Any facility that can alter the tuple before it being inserted into the 
   heap
   should probably be exercised prior to the application of the tuple against
   UNLESS's behavior.
 
  The problem is that you can un-violate a unique constraint by changing
  some other row that's already in the table. And I think that it might even
  be legal to do so in an after trigger (and in fact, some other row's after
  trigger).
  [join]
  Basically a violation at the time the row is
  created is irrelevant if the violation is gone by the end of statement.

 Okay. I can't help but think such a trigger as being questionable at best.
 However, per spec, it should be possible. =\

Yeah, it's pretty odd in the insert case.  It's easy in the update case to
make a case where it matters, definately less so for insert.

 Once the UNIQUE constraint code is relocated, I think implementing more
 standards compliant constraint timing might be substantially easier. However, 
 I
 don't think this should effect UNLESS. Rather, I think UNLESS should, more or
 less, demand that specified constraints be checked at the same time as they 
 are
 currently. This is meant to be an optimization at multiple levels; reduce code
 redundancy(rewriting constraint checks for use prior to the actual insertion),
 computational redundancy(potentially, running the rewritten checks more than
 once), and reduce unnecessary I/O(avoiding heap_insert()'ing an evil tuple
 into the target table despite the fact that the statement may later 
 inviolate
 it). Although, perhaps, it could be configurable with an option;
 INSERT INTO t UNLESS [DEFERRED] CONSTRAINT VIOLATION. =)

I'd say that if we were going to check the constraints at a different
time, we'd want a better name/description than UNLESS CONSTRAINT
VIOLATION since the unadorned INSERT or COPY might run with no constraint
violations.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread Alon Goldshuv
 Alon Goldshuv on Bizgres has been working on this as well. Maybe you
 could collaborate?  Alon?

I would love to collaborate. The proposal is neat, however, I am not too
excited about handling errors in such high granularity, as far as the user
is concerned. I am more on the same line with Tom Lane's statement in
Simon's thread (Practical error logging for very large COPY statements):

The general problem that needs to be solved is trap any error that
occurs during attempted insertion of a COPY row, and instead of aborting
the copy, record the data and the error message someplace else.  Seen
in that light, implementing a special path for uniqueness violations is
pretty pointless.

But, I definitely share your struggle to finding a good way to handle those
unique/FK constraints...

Out of your current possible known solutions list:

. Temporary table that filters out the evil tuples.
. BEFORE TRIGGER handling the tuple if the constraint of interest is
violated.
. INSERT wrapped in a subtransaction.
. (Other variations)

I really don't like Temporary tables (too much user intervention) or
subtransactions (slw). I also don't like using pg_loader for that
manner, as although it's a nice tool, isolating errors with it for large
data sets is impractical.

I guess the BEFORE TRIGGER is the closest solution to what I would like to
achieve. I think something can be done even without a trigger. We could trap
any of the following:

- bad data (any error before the tuple can be created).
- domain constraints
- check constraints
- NOT NULL constraints

As far as UNIQUE goes, maybe there is a good way to do a bt scan against the
index table right before the simple_heap_insert call? Hopefully without too
much code duplication. I am not too familiar with that code, so I don't have
a very specific idea yet. I don't know how much slower things will become
with this extra scan (I would think it will still be simpler and faster than
a subtransaction), but I figure that there is a price to pay if you want
single row error isolation. Otherwise, if the user wants to run COPY like it
is currently (all data rows or nothing) they could still do it in the same
speed using the current code path, bypassing the extra scan.

Not sure this way very helpful, but these are my thoughts at this moment.

Regards,
Alon.




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread James William Pye
On Mon, Feb 06, 2006 at 11:03:06AM -0800, Josh Berkus wrote:
 Are you sure that a new type of constraint is the way to go for this?

[Thinking that you are referring to the new constraint mode that I was
confusingly referring to...]

Well, it really wouldn't be new. It's just labeling what we do now as something
other than immediate. Considering that immediate constraints are meant to be
checked at the end of the SQL-statement, and our implementation of immediate is
truly immediate, as Stephan pointed out to me. However, I think our current
timing method is better for normal cases, at least for Postgres, than what the
spec specifies.
[See pages 63-66: The second paragraph in 4.17.2 Checking of constraints]

Ultimately, I don't care about this very much. However, I think an
implementation of my proposal would aid in implementing spec compliant
immediate timing.

[If I misunderstood what you were getting at, sorry. :]

 It doesn't solve our issues in the data warehousing space.  The spec we 
 started with for Error-tolerant COPY is:

 1) It must be able to handle parsing errors (i.e. bad char set);

My proposal did not handle this, and purposefully so. A constraint violation,
while inhibiting insertion into the target table would still yield a kosher
tuple--just not okay for that table, which could then be dropped or redirected
using the THEN INSERT INTO into another precisely structured table for later
analysis. Bad data errors would not even have a tuple to work with in the first
place, which is why I wanted to draw a distinction.

I think having something to handle bad data is useful, but I think it should be
distinct, syntactically and implementation-wise, from constraint violations.

That's not to say that it couldn't fit into the model that UNLESS would try to
create:
 COPY ... UNLESS BAD DATA [ON COLUMN (y)] OR CONSTRAINT VIOLATION [ON (z)] ...

 2) It must be able to handle constraint violations;

Check. :)

 3) It must output all row errors to a log or errors table which makes 
 it possible to determine which input row failed and why;

Check; save data errors for now.

 4) It must not slow significantly (like, not more than 15%) the speed of 
 bulk loading.

Check. (See below)

 It seems like your idea, which would involve a second constraint 
 check, would achieve neigher #1 nor #4.

I'm not proposing that a second constraint check should be made.

The difficulty of my implementation comes from the position that I don't think
the current implementation of UNIQUE constraints is ideal. It is hidden
inside nbtree, which, while convenient, is not likely to be the best place for
it. I believe my original letter covered this by proposing a new pg_am column;
one that would hold a regproc that would be able to 'scan for insert' and return
the state(position, locks, whether an entry exists, anything else necessary for
a quick insert) of that scan to the caller for later use in the actual insert or
update. All other constraints appear to require trivial modifications to get it
to work with UNLESS without any redundancy.
-- 
Regards, James William Pye

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Copy From Insert UNLESS

2006-02-06 Thread Josh Berkus
James,

 The difficulty of my implementation comes from the position that I don't
 think the current implementation of UNIQUE constraints is ideal. It is
 hidden inside nbtree, which, while convenient, is not likely to be the
 best place for it. 

Agreed; one of the things that's been on the TODO list for quite a while is 
real deferrable unique constraints that would allow for (for example) 
reordering of a UNIQUE column inside a transaction.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Copy From Insert UNLESS

2006-02-05 Thread Josh Berkus

James,


I am seeking, as many others are or have, to improve the performance on bulk
loads to live systems where constraint violations may occur and filtering can
be done more efficiently within the backend.

Primarily, I'm concerned with UNIQUE violations. However, I think tackling the
general case is the wiser action.


Alon Goldshuv on Bizgres has been working on this as well. Maybe you 
could collaborate?  Alon?


--Josh

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Copy From Insert UNLESS

2006-02-05 Thread Josh Berkus

Folks,


The subject of this letter is referring to giving INSERT and COPY FROM STDIN
the ability to alter the destination of rows that violate any constraints named
in a user specified set.


BTW,  just in case anyone thinks that James is not addressing a real and 
widespread problem, Joe Conway said in his presentation on Symer's 1.2 
TB databases that the single most painful thing they had to deal with in 
the implementation is filtering out bad rows before COPY (from OSCON2005 
presentation).


--Josh

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Copy From Insert UNLESS

2006-02-05 Thread Stephan Szabo

On Fri, 3 Feb 2006, James William Pye wrote:

 Despite the fact that my experimental patch uses error trapping, that is *not*
 what I have in mind for the implementation. I do not want to trap errors upon
 insert or copy from. Rather, I wish to implement functionality that would 
 allow
 alternate destinations for tuples that violate user specified constraints on
 the table, which, by default, will be to simply drop the tuple.

 My proposed syntax is along the lines of:

INSERT INTO table [ ( column [, ...] ) ]
 *   [UNLESS CONSTRAINT VIOLATION
  [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
 { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

  and

COPY tablename [ ( column [, ...] ) ]
 FROM { 'filename' | STDIN }
 *   [UNLESS CONSTRAINT VIOLATION
  [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
...

 The purpose of the arguably unintuitive THEN INSERT INTO x portion is to 
 provide
 the mechanism in which a user can specify the destination table for tuples 
 that
 violated the associated set of constraints. Using the OR portion allows the 
 user
 to specify additional sets of constraints for different destinations.

 A tuple will be withheld from the target table if ANY of the constraints
 listed in any of the constraint_name sets is violated. Constraint sets should
 not [may not?] reference the same constraint multiple times, even among
 different sets.

 Example:

  \d dest_table
Table public.dest_table
   Column |  Type   | Modifiers
  +-+---
   i  | integer | not null
   j  | integer |
  Indexes:
  dest_table_pkey PRIMARY KEY, btree (i)
  Check constraints:
  dest_table_j_check CHECK (j  0)

  CREATE TEMP TABLE pkey_failures (i int, j int);
  CREATE TEMP TABLE check_failures (i int, j int);

  COPY dest_table FROM STDIN
   UNLESS CONSTRAINT VIOLATION
ON (dest_table_pkey) THEN INSERT INTO pkey_failures
OR (dest_table_j_check) THEN INSERT INTO check_failures;

 For most constraints, this proposed implementation should be fairly easy to
 implement.

Have you considered how this might work with spec-compliant constraint
timing?  I think even in inserting cases, a later trigger before statement
end could in some cases un-violate a constraint, so checking before insert
won't actually be the same behavior as the normal constraint handling
which seems bad for this kind of system.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Copy From Insert UNLESS

2006-02-05 Thread James William Pye
On Sun, Feb 05, 2006 at 02:08:12PM -0800, Stephan Szabo wrote:
 Have you considered how this might work with spec-compliant constraint
 timing?

I haven't gone so far as to look into the spec, yet. [Noise of rustling papers]

However, constraints referenced in an UNLESS clause that are deferred, in any
fashion, should probably be immediated within the context of the command.
Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
to actually alter the timing of a given constraint.

 I think even in inserting cases, a later trigger before statement
 end could in some cases un-violate a constraint, so checking before insert
 won't actually be the same behavior as the normal constraint handling
 which seems bad for this kind of system.

Any facility that can alter the tuple before it being inserted into the heap
should probably be exercised prior to the application of the tuple against 
UNLESS's behavior. The implementation of UNLESS will probably completely change
ExecConstraints(), which comes after the firing of BEFORE triggers and before
heap_insert(). Beyond that, I am not sure what other considerations should be
made with respect to triggers. So, UNLESS should/will be applied after BEFORE
triggers, but before non-UNLESS specified constraints. ;)
-- 
Regards, James William Pye

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Copy From Insert UNLESS

2006-02-05 Thread Stephan Szabo

On Sun, 5 Feb 2006, James William Pye wrote:

 On Sun, Feb 05, 2006 at 02:08:12PM -0800, Stephan Szabo wrote:
  Have you considered how this might work with spec-compliant constraint
  timing?

 I haven't gone so far as to look into the spec, yet. [Noise of rustling 
 papers]

 However, constraints referenced in an UNLESS clause that are deferred, in any
 fashion, should probably be immediated within the context of the command.
 Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
 to actually alter the timing of a given constraint.

The problem is that even immediate constraints are supposed to be checked
at end of statement, not at row time.  Our implementation of UNIQUE is
particularly bad for this.  Basically a violation at the time the row is
created is irrelevant if the violation is gone by the end of statement.

  I think even in inserting cases, a later trigger before statement
  end could in some cases un-violate a constraint, so checking before insert
  won't actually be the same behavior as the normal constraint handling
  which seems bad for this kind of system.

 Any facility that can alter the tuple before it being inserted into the heap
 should probably be exercised prior to the application of the tuple against
 UNLESS's behavior.

The problem is that you can un-violate a unique constraint by changing
some other row that's already in the table. And I think that it might even
be legal to do so in an after trigger (and in fact, some other row's after
trigger).

This isn't necessarily a killer to the idea though, it probably just means
the semantics are harder to nail down.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly