Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2014-12-26 Thread Pavel Stehule
2014-12-26 11:41 GMT+01:00 Pavel Stehule :

>
>
> 2014-12-25 22:23 GMT+01:00 Alex Shulgin :
>
>> Trent Shipley  writes:
>>
>> > On Friday 2007-12-14 16:22, Tom Lane wrote:
>> >> Neil Conway  writes:
>> >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct
>> COPY
>> >> > to drop (and log) rows that contain malformed data. That is, rows
>> with
>> >> > too many or too few columns, rows that result in constraint
>> violations,
>> >> > and rows containing columns where the data type's input function
>> raises
>> >> > an error. The last case is the only thing that would be a bit tricky
>> to
>> >> > implement, I think: you could use PG_TRY() around the
>> InputFunctionCall,
>> >> > but I guess you'd need a subtransaction to ensure that you reset your
>> >> > state correctly after catching an error.
>> >>
>> >> Yeah.  It's the subtransaction per row that's daunting --- not only the
>> >> cycles spent for that, but the ensuing limitation to 4G rows imported
>> >> per COPY.
>> >
>> > You could extend the COPY FROM syntax with a COMMIT EVERY n clause.
>> This
>> > would help with the 4G subtransaction limit.  The cost to the ETL
>> process is
>> > that a simple rollback would not be guaranteed send the process back to
>> it's
>> > initial state.  There are easy ways to deal with the rollback issue
>> though.
>> >
>> > A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
>> > option is selected then the COPY FROM can run without subtransactions
>> and in
>> > excess of the 4G per transaction limit.  NO RETRY should be the default
>> since
>> > it preserves the legacy behavior of COPY FROM.
>> >
>> > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
>> give the
>> > option of sending exceptions to a table since they are presumably
>> malformed,
>> > otherwise they would not be exceptions.  (Users should re-process
>> exception
>> > files if they want an if good then table a else exception to table b
>> ...)
>> >
>> > EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>> >
>> >
>> >> If we could somehow only do a subtransaction per failure, things would
>> >> be much better, but I don't see how.
>>
>> Hello,
>>
>> Attached is a proof of concept patch for this TODO item.  There is no
>> docs yet, I just wanted to know if approach is sane.
>>
>> The added syntax is like the following:
>>
>>   COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
>>
>> The way it's done it is abusing Copy Both mode and from my limited
>> testing, that seems to just work.  The error trapping itself is done
>> using PG_TRY/PG_CATCH and can only catch formatting or before-insert
>> trigger errors, no attempt is made to recover from a failed unique
>> constraint, etc.
>>
>> Example in action:
>>
>> postgres=# \d test_copy2
>>   Table "public.test_copy2"
>>  Column |  Type   | Modifiers
>> +-+---
>>  id | integer |
>>  val| integer |
>>
>> postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
>> 1
>> NOTICE:  missing data for column "val"
>> CONTEXT:  COPY test_copy2, line 1: "1"
>> 2
>> NOTICE:  missing data for column "val"
>> CONTEXT:  COPY test_copy2, line 2: "2"
>> 3
>> NOTICE:  missing data for column "val"
>> CONTEXT:  COPY test_copy2, line 3: "3"
>> NOTICE:  total exceptions ignored: 3
>>
>> postgres=# \d test_copy1
>>   Table "public.test_copy1"
>>  Column |  Type   | Modifiers
>> +-+---
>>  id | integer | not null
>>
>> postgres=# set client_min_messages to warning;
>> SET
>> postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
>> ...
>> vmstat
>> zoneinfo
>> postgres=#
>>
>> Limited performance testing shows no significant difference between
>> error-catching and plain code path.  For example, timing
>>
>>   copy test_copy1 from program 'seq 100' [exceptions to stdout]
>>
>> shows similar numbers with or without the added "exceptions to" clause.
>>
>> Now that I'm sending this I wonder if the original comment about the
>> need for subtransaction around every loaded line still holds.  Any
>> example of what would be not properly rolled back by just PG_TRY?
>>
>
> this method is unsafe .. exception handlers doesn't free memory usually -
> there is risk of memory leaks, source leaks
>
> you can enforce same performance with block subtransactions - when you use
> subtransaction for 1000 rows, then impact of subtransactions is minimal
>
> when block fails, then you can use row level subtransaction - it works
> well when you expect almost correct data.
>

Two years ago I wrote a extension that did it - but I have not time to
finish it and push to upstream.

Regards

Pavel


>
> Regards
>
> Pavel
>
>
>>
>> Happy hacking!
>> --
>> Alex
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
>


ftcopy-04.tgz
Description: GNU Zip compresse

Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2014-12-26 Thread Pavel Stehule
2014-12-25 22:23 GMT+01:00 Alex Shulgin :

> Trent Shipley  writes:
>
> > On Friday 2007-12-14 16:22, Tom Lane wrote:
> >> Neil Conway  writes:
> >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> >> > to drop (and log) rows that contain malformed data. That is, rows with
> >> > too many or too few columns, rows that result in constraint
> violations,
> >> > and rows containing columns where the data type's input function
> raises
> >> > an error. The last case is the only thing that would be a bit tricky
> to
> >> > implement, I think: you could use PG_TRY() around the
> InputFunctionCall,
> >> > but I guess you'd need a subtransaction to ensure that you reset your
> >> > state correctly after catching an error.
> >>
> >> Yeah.  It's the subtransaction per row that's daunting --- not only the
> >> cycles spent for that, but the ensuing limitation to 4G rows imported
> >> per COPY.
> >
> > You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This
> > would help with the 4G subtransaction limit.  The cost to the ETL
> process is
> > that a simple rollback would not be guaranteed send the process back to
> it's
> > initial state.  There are easy ways to deal with the rollback issue
> though.
> >
> > A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
> > option is selected then the COPY FROM can run without subtransactions
> and in
> > excess of the 4G per transaction limit.  NO RETRY should be the default
> since
> > it preserves the legacy behavior of COPY FROM.
> >
> > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
> give the
> > option of sending exceptions to a table since they are presumably
> malformed,
> > otherwise they would not be exceptions.  (Users should re-process
> exception
> > files if they want an if good then table a else exception to table b ...)
> >
> > EXCEPTIONS TO and NO RETRY would be mutually exclusive.
> >
> >
> >> If we could somehow only do a subtransaction per failure, things would
> >> be much better, but I don't see how.
>
> Hello,
>
> Attached is a proof of concept patch for this TODO item.  There is no
> docs yet, I just wanted to know if approach is sane.
>
> The added syntax is like the following:
>
>   COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
>
> The way it's done it is abusing Copy Both mode and from my limited
> testing, that seems to just work.  The error trapping itself is done
> using PG_TRY/PG_CATCH and can only catch formatting or before-insert
> trigger errors, no attempt is made to recover from a failed unique
> constraint, etc.
>
> Example in action:
>
> postgres=# \d test_copy2
>   Table "public.test_copy2"
>  Column |  Type   | Modifiers
> +-+---
>  id | integer |
>  val| integer |
>
> postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
> 1
> NOTICE:  missing data for column "val"
> CONTEXT:  COPY test_copy2, line 1: "1"
> 2
> NOTICE:  missing data for column "val"
> CONTEXT:  COPY test_copy2, line 2: "2"
> 3
> NOTICE:  missing data for column "val"
> CONTEXT:  COPY test_copy2, line 3: "3"
> NOTICE:  total exceptions ignored: 3
>
> postgres=# \d test_copy1
>   Table "public.test_copy1"
>  Column |  Type   | Modifiers
> +-+---
>  id | integer | not null
>
> postgres=# set client_min_messages to warning;
> SET
> postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
> ...
> vmstat
> zoneinfo
> postgres=#
>
> Limited performance testing shows no significant difference between
> error-catching and plain code path.  For example, timing
>
>   copy test_copy1 from program 'seq 100' [exceptions to stdout]
>
> shows similar numbers with or without the added "exceptions to" clause.
>
> Now that I'm sending this I wonder if the original comment about the
> need for subtransaction around every loaded line still holds.  Any
> example of what would be not properly rolled back by just PG_TRY?
>

this method is unsafe .. exception handlers doesn't free memory usually -
there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use
subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well
when you expect almost correct data.

Regards

Pavel


>
> Happy hacking!
> --
> Alex
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>