Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2014-12-26 Thread Pavel Stehule
2014-12-25 22:23 GMT+01:00 Alex Shulgin a...@commandprompt.com:

 Trent Shipley trent_ship...@qwest.net writes:

  On Friday 2007-12-14 16:22, Tom Lane wrote:
  Neil Conway ne...@samurai.com writes:
   By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
   to drop (and log) rows that contain malformed data. That is, rows with
   too many or too few columns, rows that result in constraint
 violations,
   and rows containing columns where the data type's input function
 raises
   an error. The last case is the only thing that would be a bit tricky
 to
   implement, I think: you could use PG_TRY() around the
 InputFunctionCall,
   but I guess you'd need a subtransaction to ensure that you reset your
   state correctly after catching an error.
 
  Yeah.  It's the subtransaction per row that's daunting --- not only the
  cycles spent for that, but the ensuing limitation to 4G rows imported
  per COPY.
 
  You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This
  would help with the 4G subtransaction limit.  The cost to the ETL
 process is
  that a simple rollback would not be guaranteed send the process back to
 it's
  initial state.  There are easy ways to deal with the rollback issue
 though.
 
  A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
  option is selected then the COPY FROM can run without subtransactions
 and in
  excess of the 4G per transaction limit.  NO RETRY should be the default
 since
  it preserves the legacy behavior of COPY FROM.
 
  You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
 give the
  option of sending exceptions to a table since they are presumably
 malformed,
  otherwise they would not be exceptions.  (Users should re-process
 exception
  files if they want an if good then table a else exception to table b ...)
 
  EXCEPTIONS TO and NO RETRY would be mutually exclusive.
 
 
  If we could somehow only do a subtransaction per failure, things would
  be much better, but I don't see how.

 Hello,

 Attached is a proof of concept patch for this TODO item.  There is no
 docs yet, I just wanted to know if approach is sane.

 The added syntax is like the following:

   COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

 The way it's done it is abusing Copy Both mode and from my limited
 testing, that seems to just work.  The error trapping itself is done
 using PG_TRY/PG_CATCH and can only catch formatting or before-insert
 trigger errors, no attempt is made to recover from a failed unique
 constraint, etc.

 Example in action:

 postgres=# \d test_copy2
   Table public.test_copy2
  Column |  Type   | Modifiers
 +-+---
  id | integer |
  val| integer |

 postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
 1
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 1: 1
 2
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 2: 2
 3
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 3: 3
 NOTICE:  total exceptions ignored: 3

 postgres=# \d test_copy1
   Table public.test_copy1
  Column |  Type   | Modifiers
 +-+---
  id | integer | not null

 postgres=# set client_min_messages to warning;
 SET
 postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
 ...
 vmstat
 zoneinfo
 postgres=#

 Limited performance testing shows no significant difference between
 error-catching and plain code path.  For example, timing

   copy test_copy1 from program 'seq 100' [exceptions to stdout]

 shows similar numbers with or without the added exceptions to clause.

 Now that I'm sending this I wonder if the original comment about the
 need for subtransaction around every loaded line still holds.  Any
 example of what would be not properly rolled back by just PG_TRY?


this method is unsafe .. exception handlers doesn't free memory usually -
there is risk of memory leaks, source leaks

you can enforce same performance with block subtransactions - when you use
subtransaction for 1000 rows, then impact of subtransactions is minimal

when block fails, then you can use row level subtransaction - it works well
when you expect almost correct data.

Regards

Pavel



 Happy hacking!
 --
 Alex



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2014-12-26 Thread Pavel Stehule
2014-12-26 11:41 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:



 2014-12-25 22:23 GMT+01:00 Alex Shulgin a...@commandprompt.com:

 Trent Shipley trent_ship...@qwest.net writes:

  On Friday 2007-12-14 16:22, Tom Lane wrote:
  Neil Conway ne...@samurai.com writes:
   By modifying COPY: COPY IGNORE ERRORS or some such would instruct
 COPY
   to drop (and log) rows that contain malformed data. That is, rows
 with
   too many or too few columns, rows that result in constraint
 violations,
   and rows containing columns where the data type's input function
 raises
   an error. The last case is the only thing that would be a bit tricky
 to
   implement, I think: you could use PG_TRY() around the
 InputFunctionCall,
   but I guess you'd need a subtransaction to ensure that you reset your
   state correctly after catching an error.
 
  Yeah.  It's the subtransaction per row that's daunting --- not only the
  cycles spent for that, but the ensuing limitation to 4G rows imported
  per COPY.
 
  You could extend the COPY FROM syntax with a COMMIT EVERY n clause.
 This
  would help with the 4G subtransaction limit.  The cost to the ETL
 process is
  that a simple rollback would not be guaranteed send the process back to
 it's
  initial state.  There are easy ways to deal with the rollback issue
 though.
 
  A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY
  option is selected then the COPY FROM can run without subtransactions
 and in
  excess of the 4G per transaction limit.  NO RETRY should be the default
 since
  it preserves the legacy behavior of COPY FROM.
 
  You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
 give the
  option of sending exceptions to a table since they are presumably
 malformed,
  otherwise they would not be exceptions.  (Users should re-process
 exception
  files if they want an if good then table a else exception to table b
 ...)
 
  EXCEPTIONS TO and NO RETRY would be mutually exclusive.
 
 
  If we could somehow only do a subtransaction per failure, things would
  be much better, but I don't see how.

 Hello,

 Attached is a proof of concept patch for this TODO item.  There is no
 docs yet, I just wanted to know if approach is sane.

 The added syntax is like the following:

   COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

 The way it's done it is abusing Copy Both mode and from my limited
 testing, that seems to just work.  The error trapping itself is done
 using PG_TRY/PG_CATCH and can only catch formatting or before-insert
 trigger errors, no attempt is made to recover from a failed unique
 constraint, etc.

 Example in action:

 postgres=# \d test_copy2
   Table public.test_copy2
  Column |  Type   | Modifiers
 +-+---
  id | integer |
  val| integer |

 postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
 1
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 1: 1
 2
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 2: 2
 3
 NOTICE:  missing data for column val
 CONTEXT:  COPY test_copy2, line 3: 3
 NOTICE:  total exceptions ignored: 3

 postgres=# \d test_copy1
   Table public.test_copy1
  Column |  Type   | Modifiers
 +-+---
  id | integer | not null

 postgres=# set client_min_messages to warning;
 SET
 postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
 ...
 vmstat
 zoneinfo
 postgres=#

 Limited performance testing shows no significant difference between
 error-catching and plain code path.  For example, timing

   copy test_copy1 from program 'seq 100' [exceptions to stdout]

 shows similar numbers with or without the added exceptions to clause.

 Now that I'm sending this I wonder if the original comment about the
 need for subtransaction around every loaded line still holds.  Any
 example of what would be not properly rolled back by just PG_TRY?


 this method is unsafe .. exception handlers doesn't free memory usually -
 there is risk of memory leaks, source leaks

 you can enforce same performance with block subtransactions - when you use
 subtransaction for 1000 rows, then impact of subtransactions is minimal

 when block fails, then you can use row level subtransaction - it works
 well when you expect almost correct data.


Two years ago I wrote a extension that did it - but I have not time to
finish it and push to upstream.

Regards

Pavel



 Regards

 Pavel



 Happy hacking!
 --
 Alex



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers





ftcopy-04.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2014-12-25 Thread Alex Shulgin
Trent Shipley trent_ship...@qwest.net writes:

 On Friday 2007-12-14 16:22, Tom Lane wrote:
 Neil Conway ne...@samurai.com writes:
  By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
  to drop (and log) rows that contain malformed data. That is, rows with
  too many or too few columns, rows that result in constraint violations,
  and rows containing columns where the data type's input function raises
  an error. The last case is the only thing that would be a bit tricky to
  implement, I think: you could use PG_TRY() around the InputFunctionCall,
  but I guess you'd need a subtransaction to ensure that you reset your
  state correctly after catching an error.

 Yeah.  It's the subtransaction per row that's daunting --- not only the
 cycles spent for that, but the ensuing limitation to 4G rows imported
 per COPY.

 You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This 
 would help with the 4G subtransaction limit.  The cost to the ETL process is 
 that a simple rollback would not be guaranteed send the process back to it's 
 initial state.  There are easy ways to deal with the rollback issue though.  

 A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY 
 option is selected then the COPY FROM can run without subtransactions and in 
 excess of the 4G per transaction limit.  NO RETRY should be the default since 
 it preserves the legacy behavior of COPY FROM.

 You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give 
 the 
 option of sending exceptions to a table since they are presumably malformed, 
 otherwise they would not be exceptions.  (Users should re-process exception 
 files if they want an if good then table a else exception to table b ...)

 EXCEPTIONS TO and NO RETRY would be mutually exclusive.


 If we could somehow only do a subtransaction per failure, things would
 be much better, but I don't see how.

Hello,

Attached is a proof of concept patch for this TODO item.  There is no
docs yet, I just wanted to know if approach is sane.

The added syntax is like the following:

  COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]

The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work.  The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.

Example in action:

postgres=# \d test_copy2
  Table public.test_copy2
 Column |  Type   | Modifiers 
+-+---
 id | integer | 
 val| integer | 

postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE:  missing data for column val
CONTEXT:  COPY test_copy2, line 1: 1
2
NOTICE:  missing data for column val
CONTEXT:  COPY test_copy2, line 2: 2
3
NOTICE:  missing data for column val
CONTEXT:  COPY test_copy2, line 3: 3
NOTICE:  total exceptions ignored: 3

postgres=# \d test_copy1
  Table public.test_copy1
 Column |  Type   | Modifiers 
+-+---
 id | integer | not null

postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=# 

Limited performance testing shows no significant difference between
error-catching and plain code path.  For example, timing

  copy test_copy1 from program 'seq 100' [exceptions to stdout]

shows similar numbers with or without the added exceptions to clause.

Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds.  Any
example of what would be not properly rolled back by just PG_TRY?

Happy hacking!
--
Alex

From 50f7ab0a503a0d61776add8a138abf2622fc6c35 Mon Sep 17 00:00:00 2001
From: Alex Shulgin a...@commandprompt.com
Date: Fri, 19 Dec 2014 18:21:31 +0300
Subject: [PATCH] POC: COPY FROM ... EXCEPTIONS TO

---
 contrib/file_fdw/file_fdw.c |   4 +-
 src/backend/commands/copy.c | 251 +---
 src/backend/parser/gram.y   |  26 +++-
 src/bin/psql/common.c   |  14 +-
 src/bin/psql/copy.c | 119 ++-
 src/bin/psql/settings.h |   1 +
 src/bin/psql/startup.c  |   1 +
 src/bin/psql/tab-complete.c |  12 +-
 src/include/commands/copy.h |   3 +-
 src/include/nodes/parsenodes.h  |   1 +
 src/include/parser/kwlist.h |   1 +
 src/interfaces/ecpg/preproc/ecpg.addons |   2 +-
 12 files changed, 396 insertions(+), 39 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
new file mode 100644
index 5a4d5aa..0df02f7
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
*** fileBeginForeignScan(ForeignScanState *n
*** 624,629 
--- 624,630 
  	cstate =