Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker

2005-12-27 Thread Bruce Momjian
Bruce Momjian wrote:
> Andrew Dunstan wrote:
> > 
> > 
> > Bruce Momjian wrote:
> > 
> > > The big problem is that \. is also a valid
> > >CSV data value (though not a valid non-CSV data value).  So, the
> > >solution we came up with was to require \. to appear alone on a line in
> > >CSV mode for it to be treated as end-of-copy.  
> > >
> > 
> > According to the docs, that's the way to specify EOD in both text and 
> > CSV mode:
> > 
> >   End of data can be represented by a single line containing just 
> > backslash-period.
> 
> Right, but in non-CSV mode, we allow \. at the end of any line because
> it is unique so I kept that behavior.  That is not documented however.
> 
> > Your analysis regarding line_buf.len seems correct.
> > 
> > We probably should have a regression test with \. in a CSV field.
> 
> Agreed.  My test for CSV was simple, just try loading:
> 
>   x\.
>   x\.b
>   \.c
> 
> all should load literally, but they fail.

OK, original patch applied to HEAD and smaller version to 8.1.X, and
regression test added, now attached.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
Index: src/test/regress/expected/copy2.out
===
RCS file: /cvsroot/pgsql/src/test/regress/expected/copy2.out,v
retrieving revision 1.22
diff -c -c -r1.22 copy2.out
*** src/test/regress/expected/copy2.out 26 Jun 2005 03:04:18 -  1.22
--- src/test/regress/expected/copy2.out 27 Dec 2005 18:19:36 -
***
*** 194,199 
--- 194,202 
  --test that we read consecutive LFs properly
  CREATE TEMP TABLE testnl (a int, b text, c int);
  COPY testnl FROM stdin CSV;
+ -- test end of copy marker
+ CREATE TEMP TABLE testeoc (a text);
+ COPY testeoc FROM stdin CSV;
  DROP TABLE x, y;
  DROP FUNCTION fn_x_before();
  DROP FUNCTION fn_x_after();
Index: src/test/regress/sql/copy2.sql
===
RCS file: /cvsroot/pgsql/src/test/regress/sql/copy2.sql,v
retrieving revision 1.13
diff -c -c -r1.13 copy2.sql
*** src/test/regress/sql/copy2.sql  26 Jun 2005 03:04:37 -  1.13
--- src/test/regress/sql/copy2.sql  27 Dec 2005 18:19:36 -
***
*** 139,144 
--- 139,153 
  inside",2
  \.
  
+ -- test end of copy marker
+ CREATE TEMP TABLE testeoc (a text);
+ 
+ COPY testeoc FROM stdin CSV;
+ a\.
+ \.b
+ c\.d
+ \.
+ 
  
  DROP TABLE x, y;
  DROP FUNCTION fn_x_before();

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker

2005-12-27 Thread Bruce Momjian
Andrew Dunstan wrote:
> 
> 
> Bruce Momjian wrote:
> 
> > The big problem is that \. is also a valid
> >CSV data value (though not a valid non-CSV data value).  So, the
> >solution we came up with was to require \. to appear alone on a line in
> >CSV mode for it to be treated as end-of-copy.  
> >
> 
> According to the docs, that's the way to specify EOD in both text and 
> CSV mode:
> 
>   End of data can be represented by a single line containing just 
> backslash-period.

Right, but in non-CSV mode, we allow \. at the end of any line because
it is unique so I kept that behavior.  That is not documented however.

> Your analysis regarding line_buf.len seems correct.
> 
> We probably should have a regression test with \. in a CSV field.

Agreed.  My test for CSV was simple, just try loading:

x\.
x\.b
\.c

all should load literally, but they fail.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy

2005-12-27 Thread Andrew Dunstan



Bruce Momjian wrote:


The big problem is that \. is also a valid
CSV data value (though not a valid non-CSV data value).  So, the
solution we came up with was to require \. to appear alone on a line in
CSV mode for it to be treated as end-of-copy.  



According to the docs, that's the way to specify EOD in both text and 
CSV mode:


 End of data can be represented by a single line containing just 
backslash-period.


Your analysis regarding line_buf.len seems correct.

We probably should have a regression test with \. in a CSV field.

cheers

andrew


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt

2005-12-26 Thread Bruce Momjian

Sorry for the delay in responding.  I have done research on your bug
report, and the problem seems even worse than you reported.  First, a
little background.  In non-CSV text mode, the backslash is the escape
character, so any character appearing after a backslash (all 255 of
them) is treated specially, e.g. \{delimiter}, \r, \n, and for our case
here '\.'.  (A literal backslash is \\).  In CSV mode, the quote is
special, but we don't have 255 special characters after a quote.  Only
two double-quotes, "", are special, a literal double-quote.

This behavior gives us problems for specifying the end-of-copy marker,
which is \, in both modes.  The big problem is that \. is also a valid
CSV data value (though not a valid non-CSV data value).  So, the
solution we came up with was to require \. to appear alone on a line in
CSV mode for it to be treated as end-of-copy.  Your idea of using quotes
worked, but it wasn't the right solution.  We need to enforce the
alone-on-a-line restriction.  Our code had:

if (c == '\\' && cstate->line_buf.len == 0)

The problem with that is the because of the input and _output_
buffering, cstate->line_buf.len could be zero even if we are not on the
first character of a line.  In fact, for a typical line, it is zero for
all characters on the line.  The proper solution is to introduce a
boolean, first_char_in_line, that we set as we enter the loop and clear
once we process a character.

Looking closer at the code, I see the reason for email comments like
"the copy code is nearing unmaintainability.  The CSV/non-CSV code was
already complex, but the buffering additions in 8.1 pushed it over the
edge.

I have restructured the line-reading code in copy.c by:

o  merging the CSV/non-CSV functions into a single function
o  used macros to centralize and clarify the buffering code
o  updated comments
o  renamed client_encoding_only to encoding_embeds_ascii
o  added a high-bit test to the encoding_embeds_ascii test for
   performance
o  in CSV mode, allow a backslash followed by a non-period to 
   continue being processed as a data value

There should be no performance impact from this patch because it is
functionally equivalent.  If you apply the patch you will see copy.c is
much clearer in this area now and might suggest additional
optimizations.

I have also attached a 8.1-only patch to fix the CSV \. handling bug
with no code restructuring.  

---

Ben Gould wrote:
> 
> The following bug has been logged online:
> 
> Bug reference:  2114
> Logged by:  Ben Gould
> Email address:  [EMAIL PROTECTED]
> PostgreSQL version: 8.1.0
> Operating system:   Mac OS X 10.4.3
> Description:(patch) COPY FROM ... end of copy marker corrupt
> Details: 
> 
> With a table like:
> 
> CREATE TABLE test_table (
> foo text,
> bar text,
> baz text
> );
> 
> Using this format for COPY FROM:
> 
> COPY test_table FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"'
> 
> Where the file was generated via:
> 
> COPY test_table TO STDOUT WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"' FORCE QUOTE foo, bar, baz;
> 
> I needed this patch:
> 
> <<<
> --- postgresql-8.1.0.original/src/backend/commands/copy.c   2005-12-13
> 13:18:16.0 +0100
> +++ postgresql-8.1.0/src/backend/commands/copy.c2005-12-13
> 13:28:28.0 +0100
> @@ -2531,7 +2531,7 @@
> /*
>  * In CSV mode, we only recognize \. at start of line
>  */
> -   if (c == '\\' && cstate->line_buf.len == 0)
> +   if (c == '\\' && !in_quote && cstate->line_buf.len == 0)
> {
> charc2;
> >>>
> 
> Because of this error message:
> 
> pg_endcopy warning: ERROR:  end-of-copy marker corrupt
> 
> (We have quoted strings containing things like ..\..\.. in the CSV file
> which broke the copy from.)
> 
> I was using DBD::Pg as the client library.
> 
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
Index: src/backend/commands/copy.c
===
RCS file: /cvsroot/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.255
diff -c -c -r1.255 copy.c
*** src/backend/commands/copy.c 22 Nov 2005 18:17:08 -  1.255
--- src/backend/commands/copy.c 27 Dec 2005 02:10:18 -
***
*** 76,94 
  
  /*
   * This struct contains all the state vari