Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
2014-12-25 22:23 GMT+01:00 Alex Shulgin a...@commandprompt.com: Trent Shipley trent_ship...@qwest.net writes: On Friday 2007-12-14 16:22, Tom Lane wrote: Neil Conway ne...@samurai.com writes: By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY to drop (and log) rows that contain malformed data. That is, rows with too many or too few columns, rows that result in constraint violations, and rows containing columns where the data type's input function raises an error. The last case is the only thing that would be a bit tricky to implement, I think: you could use PG_TRY() around the InputFunctionCall, but I guess you'd need a subtransaction to ensure that you reset your state correctly after catching an error. Yeah. It's the subtransaction per row that's daunting --- not only the cycles spent for that, but the ensuing limitation to 4G rows imported per COPY. You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This would help with the 4G subtransaction limit. The cost to the ETL process is that a simple rollback would not be guaranteed send the process back to it's initial state. There are easy ways to deal with the rollback issue though. A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY option is selected then the COPY FROM can run without subtransactions and in excess of the 4G per transaction limit. NO RETRY should be the default since it preserves the legacy behavior of COPY FROM. You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the option of sending exceptions to a table since they are presumably malformed, otherwise they would not be exceptions. (Users should re-process exception files if they want an if good then table a else exception to table b ...) EXCEPTIONS TO and NO RETRY would be mutually exclusive. If we could somehow only do a subtransaction per failure, things would be much better, but I don't see how. Hello, Attached is a proof of concept patch for this TODO item. There is no docs yet, I just wanted to know if approach is sane. The added syntax is like the following: COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] The way it's done it is abusing Copy Both mode and from my limited testing, that seems to just work. The error trapping itself is done using PG_TRY/PG_CATCH and can only catch formatting or before-insert trigger errors, no attempt is made to recover from a failed unique constraint, etc. Example in action: postgres=# \d test_copy2 Table public.test_copy2 Column | Type | Modifiers +-+--- id | integer | val| integer | postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; 1 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 1: 1 2 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 2: 2 3 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 3: 3 NOTICE: total exceptions ignored: 3 postgres=# \d test_copy1 Table public.test_copy1 Column | Type | Modifiers +-+--- id | integer | not null postgres=# set client_min_messages to warning; SET postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; ... vmstat zoneinfo postgres=# Limited performance testing shows no significant difference between error-catching and plain code path. For example, timing copy test_copy1 from program 'seq 100' [exceptions to stdout] shows similar numbers with or without the added exceptions to clause. Now that I'm sending this I wonder if the original comment about the need for subtransaction around every loaded line still holds. Any example of what would be not properly rolled back by just PG_TRY? this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal when block fails, then you can use row level subtransaction - it works well when you expect almost correct data. Regards Pavel Happy hacking! -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
2014-12-26 11:41 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com: 2014-12-25 22:23 GMT+01:00 Alex Shulgin a...@commandprompt.com: Trent Shipley trent_ship...@qwest.net writes: On Friday 2007-12-14 16:22, Tom Lane wrote: Neil Conway ne...@samurai.com writes: By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY to drop (and log) rows that contain malformed data. That is, rows with too many or too few columns, rows that result in constraint violations, and rows containing columns where the data type's input function raises an error. The last case is the only thing that would be a bit tricky to implement, I think: you could use PG_TRY() around the InputFunctionCall, but I guess you'd need a subtransaction to ensure that you reset your state correctly after catching an error. Yeah. It's the subtransaction per row that's daunting --- not only the cycles spent for that, but the ensuing limitation to 4G rows imported per COPY. You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This would help with the 4G subtransaction limit. The cost to the ETL process is that a simple rollback would not be guaranteed send the process back to it's initial state. There are easy ways to deal with the rollback issue though. A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY option is selected then the COPY FROM can run without subtransactions and in excess of the 4G per transaction limit. NO RETRY should be the default since it preserves the legacy behavior of COPY FROM. You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the option of sending exceptions to a table since they are presumably malformed, otherwise they would not be exceptions. (Users should re-process exception files if they want an if good then table a else exception to table b ...) EXCEPTIONS TO and NO RETRY would be mutually exclusive. If we could somehow only do a subtransaction per failure, things would be much better, but I don't see how. Hello, Attached is a proof of concept patch for this TODO item. There is no docs yet, I just wanted to know if approach is sane. The added syntax is like the following: COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] The way it's done it is abusing Copy Both mode and from my limited testing, that seems to just work. The error trapping itself is done using PG_TRY/PG_CATCH and can only catch formatting or before-insert trigger errors, no attempt is made to recover from a failed unique constraint, etc. Example in action: postgres=# \d test_copy2 Table public.test_copy2 Column | Type | Modifiers +-+--- id | integer | val| integer | postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; 1 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 1: 1 2 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 2: 2 3 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 3: 3 NOTICE: total exceptions ignored: 3 postgres=# \d test_copy1 Table public.test_copy1 Column | Type | Modifiers +-+--- id | integer | not null postgres=# set client_min_messages to warning; SET postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; ... vmstat zoneinfo postgres=# Limited performance testing shows no significant difference between error-catching and plain code path. For example, timing copy test_copy1 from program 'seq 100' [exceptions to stdout] shows similar numbers with or without the added exceptions to clause. Now that I'm sending this I wonder if the original comment about the need for subtransaction around every loaded line still holds. Any example of what would be not properly rolled back by just PG_TRY? this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal when block fails, then you can use row level subtransaction - it works well when you expect almost correct data. Two years ago I wrote a extension that did it - but I have not time to finish it and push to upstream. Regards Pavel Regards Pavel Happy hacking! -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers ftcopy-04.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Trent Shipley trent_ship...@qwest.net writes: On Friday 2007-12-14 16:22, Tom Lane wrote: Neil Conway ne...@samurai.com writes: By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY to drop (and log) rows that contain malformed data. That is, rows with too many or too few columns, rows that result in constraint violations, and rows containing columns where the data type's input function raises an error. The last case is the only thing that would be a bit tricky to implement, I think: you could use PG_TRY() around the InputFunctionCall, but I guess you'd need a subtransaction to ensure that you reset your state correctly after catching an error. Yeah. It's the subtransaction per row that's daunting --- not only the cycles spent for that, but the ensuing limitation to 4G rows imported per COPY. You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This would help with the 4G subtransaction limit. The cost to the ETL process is that a simple rollback would not be guaranteed send the process back to it's initial state. There are easy ways to deal with the rollback issue though. A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY option is selected then the COPY FROM can run without subtransactions and in excess of the 4G per transaction limit. NO RETRY should be the default since it preserves the legacy behavior of COPY FROM. You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the option of sending exceptions to a table since they are presumably malformed, otherwise they would not be exceptions. (Users should re-process exception files if they want an if good then table a else exception to table b ...) EXCEPTIONS TO and NO RETRY would be mutually exclusive. If we could somehow only do a subtransaction per failure, things would be much better, but I don't see how. Hello, Attached is a proof of concept patch for this TODO item. There is no docs yet, I just wanted to know if approach is sane. The added syntax is like the following: COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] The way it's done it is abusing Copy Both mode and from my limited testing, that seems to just work. The error trapping itself is done using PG_TRY/PG_CATCH and can only catch formatting or before-insert trigger errors, no attempt is made to recover from a failed unique constraint, etc. Example in action: postgres=# \d test_copy2 Table public.test_copy2 Column | Type | Modifiers +-+--- id | integer | val| integer | postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; 1 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 1: 1 2 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 2: 2 3 NOTICE: missing data for column val CONTEXT: COPY test_copy2, line 3: 3 NOTICE: total exceptions ignored: 3 postgres=# \d test_copy1 Table public.test_copy1 Column | Type | Modifiers +-+--- id | integer | not null postgres=# set client_min_messages to warning; SET postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; ... vmstat zoneinfo postgres=# Limited performance testing shows no significant difference between error-catching and plain code path. For example, timing copy test_copy1 from program 'seq 100' [exceptions to stdout] shows similar numbers with or without the added exceptions to clause. Now that I'm sending this I wonder if the original comment about the need for subtransaction around every loaded line still holds. Any example of what would be not properly rolled back by just PG_TRY? Happy hacking! -- Alex From 50f7ab0a503a0d61776add8a138abf2622fc6c35 Mon Sep 17 00:00:00 2001 From: Alex Shulgin a...@commandprompt.com Date: Fri, 19 Dec 2014 18:21:31 +0300 Subject: [PATCH] POC: COPY FROM ... EXCEPTIONS TO --- contrib/file_fdw/file_fdw.c | 4 +- src/backend/commands/copy.c | 251 +--- src/backend/parser/gram.y | 26 +++- src/bin/psql/common.c | 14 +- src/bin/psql/copy.c | 119 ++- src/bin/psql/settings.h | 1 + src/bin/psql/startup.c | 1 + src/bin/psql/tab-complete.c | 12 +- src/include/commands/copy.h | 3 +- src/include/nodes/parsenodes.h | 1 + src/include/parser/kwlist.h | 1 + src/interfaces/ecpg/preproc/ecpg.addons | 2 +- 12 files changed, 396 insertions(+), 39 deletions(-) diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c new file mode 100644 index 5a4d5aa..0df02f7 *** a/contrib/file_fdw/file_fdw.c --- b/contrib/file_fdw/file_fdw.c *** fileBeginForeignScan(ForeignScanState *n *** 624,629 --- 624,630 cstate =