Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker
Bruce Momjian wrote: > Andrew Dunstan wrote: > > > > > > Bruce Momjian wrote: > > > > > The big problem is that \. is also a valid > > >CSV data value (though not a valid non-CSV data value). So, the > > >solution we came up with was to require \. to appear alone on a line in > > >CSV mode for it to be treated as end-of-copy. > > > > > > > According to the docs, that's the way to specify EOD in both text and > > CSV mode: > > > > End of data can be represented by a single line containing just > > backslash-period. > > Right, but in non-CSV mode, we allow \. at the end of any line because > it is unique so I kept that behavior. That is not documented however. > > > Your analysis regarding line_buf.len seems correct. > > > > We probably should have a regression test with \. in a CSV field. > > Agreed. My test for CSV was simple, just try loading: > > x\. > x\.b > \.c > > all should load literally, but they fail. OK, original patch applied to HEAD and smaller version to 8.1.X, and regression test added, now attached. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 Index: src/test/regress/expected/copy2.out === RCS file: /cvsroot/pgsql/src/test/regress/expected/copy2.out,v retrieving revision 1.22 diff -c -c -r1.22 copy2.out *** src/test/regress/expected/copy2.out 26 Jun 2005 03:04:18 - 1.22 --- src/test/regress/expected/copy2.out 27 Dec 2005 18:19:36 - *** *** 194,199 --- 194,202 --test that we read consecutive LFs properly CREATE TEMP TABLE testnl (a int, b text, c int); COPY testnl FROM stdin CSV; + -- test end of copy marker + CREATE TEMP TABLE testeoc (a text); + COPY testeoc FROM stdin CSV; DROP TABLE x, y; DROP FUNCTION fn_x_before(); DROP FUNCTION fn_x_after(); Index: src/test/regress/sql/copy2.sql === RCS file: /cvsroot/pgsql/src/test/regress/sql/copy2.sql,v retrieving revision 1.13 diff -c -c -r1.13 copy2.sql *** src/test/regress/sql/copy2.sql 26 Jun 2005 03:04:37 - 1.13 --- src/test/regress/sql/copy2.sql 27 Dec 2005 18:19:36 - *** *** 139,144 --- 139,153 inside",2 \. + -- test end of copy marker + CREATE TEMP TABLE testeoc (a text); + + COPY testeoc FROM stdin CSV; + a\. + \.b + c\.d + \. + DROP TABLE x, y; DROP FUNCTION fn_x_before(); ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker
Andrew Dunstan wrote: > > > Bruce Momjian wrote: > > > The big problem is that \. is also a valid > >CSV data value (though not a valid non-CSV data value). So, the > >solution we came up with was to require \. to appear alone on a line in > >CSV mode for it to be treated as end-of-copy. > > > > According to the docs, that's the way to specify EOD in both text and > CSV mode: > > End of data can be represented by a single line containing just > backslash-period. Right, but in non-CSV mode, we allow \. at the end of any line because it is unique so I kept that behavior. That is not documented however. > Your analysis regarding line_buf.len seems correct. > > We probably should have a regression test with \. in a CSV field. Agreed. My test for CSV was simple, just try loading: x\. x\.b \.c all should load literally, but they fail. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy
Bruce Momjian wrote: The big problem is that \. is also a valid CSV data value (though not a valid non-CSV data value). So, the solution we came up with was to require \. to appear alone on a line in CSV mode for it to be treated as end-of-copy. According to the docs, that's the way to specify EOD in both text and CSV mode: End of data can be represented by a single line containing just backslash-period. Your analysis regarding line_buf.len seems correct. We probably should have a regression test with \. in a CSV field. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt
Sorry for the delay in responding. I have done research on your bug report, and the problem seems even worse than you reported. First, a little background. In non-CSV text mode, the backslash is the escape character, so any character appearing after a backslash (all 255 of them) is treated specially, e.g. \{delimiter}, \r, \n, and for our case here '\.'. (A literal backslash is \\). In CSV mode, the quote is special, but we don't have 255 special characters after a quote. Only two double-quotes, "", are special, a literal double-quote. This behavior gives us problems for specifying the end-of-copy marker, which is \, in both modes. The big problem is that \. is also a valid CSV data value (though not a valid non-CSV data value). So, the solution we came up with was to require \. to appear alone on a line in CSV mode for it to be treated as end-of-copy. Your idea of using quotes worked, but it wasn't the right solution. We need to enforce the alone-on-a-line restriction. Our code had: if (c == '\\' && cstate->line_buf.len == 0) The problem with that is the because of the input and _output_ buffering, cstate->line_buf.len could be zero even if we are not on the first character of a line. In fact, for a typical line, it is zero for all characters on the line. The proper solution is to introduce a boolean, first_char_in_line, that we set as we enter the loop and clear once we process a character. Looking closer at the code, I see the reason for email comments like "the copy code is nearing unmaintainability. The CSV/non-CSV code was already complex, but the buffering additions in 8.1 pushed it over the edge. I have restructured the line-reading code in copy.c by: o merging the CSV/non-CSV functions into a single function o used macros to centralize and clarify the buffering code o updated comments o renamed client_encoding_only to encoding_embeds_ascii o added a high-bit test to the encoding_embeds_ascii test for performance o in CSV mode, allow a backslash followed by a non-period to continue being processed as a data value There should be no performance impact from this patch because it is functionally equivalent. If you apply the patch you will see copy.c is much clearer in this area now and might suggest additional optimizations. I have also attached a 8.1-only patch to fix the CSV \. handling bug with no code restructuring. --- Ben Gould wrote: > > The following bug has been logged online: > > Bug reference: 2114 > Logged by: Ben Gould > Email address: [EMAIL PROTECTED] > PostgreSQL version: 8.1.0 > Operating system: Mac OS X 10.4.3 > Description:(patch) COPY FROM ... end of copy marker corrupt > Details: > > With a table like: > > CREATE TABLE test_table ( > foo text, > bar text, > baz text > ); > > Using this format for COPY FROM: > > COPY test_table FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL' > QUOTE AS '\"' ESCAPE AS '\"' > > Where the file was generated via: > > COPY test_table TO STDOUT WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL' > QUOTE AS '\"' ESCAPE AS '\"' FORCE QUOTE foo, bar, baz; > > I needed this patch: > > <<< > --- postgresql-8.1.0.original/src/backend/commands/copy.c 2005-12-13 > 13:18:16.0 +0100 > +++ postgresql-8.1.0/src/backend/commands/copy.c2005-12-13 > 13:28:28.0 +0100 > @@ -2531,7 +2531,7 @@ > /* > * In CSV mode, we only recognize \. at start of line > */ > - if (c == '\\' && cstate->line_buf.len == 0) > + if (c == '\\' && !in_quote && cstate->line_buf.len == 0) > { > charc2; > >>> > > Because of this error message: > > pg_endcopy warning: ERROR: end-of-copy marker corrupt > > (We have quoted strings containing things like ..\..\.. in the CSV file > which broke the copy from.) > > I was using DBD::Pg as the client library. > > ---(end of broadcast)--- > TIP 3: Have you checked our extensive FAQ? > >http://www.postgresql.org/docs/faq > -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 Index: src/backend/commands/copy.c === RCS file: /cvsroot/pgsql/src/backend/commands/copy.c,v retrieving revision 1.255 diff -c -c -r1.255 copy.c *** src/backend/commands/copy.c 22 Nov 2005 18:17:08 - 1.255 --- src/backend/commands/copy.c 27 Dec 2005 02:10:18 - *** *** 76,94 /* * This struct contains all the state vari