Re: [HACKERS] multiline CSV fields

2004-12-02 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 + if (!embedded_line_warning   (c == '\n' || c == '\r') )
 + {
 + embedded_line_warning = true;
 + elog(WARNING,
 +  CSV fields with embedded linefeed or carriage 
 return 
 +  characters might not be able to be 
 reimported);
 + }

What about forcibly translating them to the two-character sequences \n
or \r?  Or is that not considered a CSV-compatible representation?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-12-02 Thread Andrew Dunstan

Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
 

+ 		if (!embedded_line_warning   (c == '\n' || c == '\r') )
+ 		{
+ 			embedded_line_warning = true;
+ 			elog(WARNING,
+  CSV fields with embedded linefeed or carriage return 
+  characters might not be able to be reimported);
+ 		}
   

What about forcibly translating them to the two-character sequences \n
or \r?  Or is that not considered a CSV-compatible representation?
 

Not compatible AFAIK. Certainly not portably. And the warning would 
still be true, because we don't do this unescaping on the way back in. I 
think the way the comment in the patch suggests and previous emails have 
discussed is the right way to go with this - I will attend to it after 
we branch ;-)

cheers
andrew
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Kris Jurka


On Tue, 30 Nov 2004, Greg Stark wrote:

 
 Andrew Dunstan [EMAIL PROTECTED] writes:
 
  The advantage of having it in COPY is that it can be done serverside
  direct from the file system. For massive bulk loads that might be a
  plus, although I don't know what the protocol+socket overhead is.
 
 Actually even if you use client-side COPY it's *still* more efficient than any
 more general client-side alternative.

The idea would be to still use COPY just from a program that did 
additional processing, not as direct SQL.

 As Tom pointed out to me a while back, neither the protocol nor libpq allow
 for having multiple queries in flight simultaneously. That makes it impossible
 to stream large quantities of data to the server efficiently. Each record
 requires a round-trip and context switch overhead.

Multiplexing queries is different than having multiple queries in flight.  
You can have multiple queries on the wire now, they are just processed 
sequentially.

 In an ideal world the client should be able to queue up enough records to fill
 the socket buffers and allow the kernel to switch to a more batch oriented
 context switch mode where the server can process large numbers of records
 before switching back to the client. Ideally this would apply to any kind of
 query execution.

This is possible now with the V3 protocol (and used by the JDBC driver).  
For an executeBatch() on a PreparedStatement, the driver sends a parse
message, then any number of bind/execute pairs, but with a buffered stream
nothing happens until a Sync message is sent and the stream is flushed.  
Then it collects the results of all of the executes.

Kris Jurka

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Andrew Dunstan

Greg Stark wrote:
Personally I find the current CSV support inadequate. It seems pointless to
support CSV if it can't load data exported from Excel, which seems like the
main use case. 
 

OK, I'm starting to get mildly annoyed now. We have identified one 
failure case connected with multiline fields. Even in the multiline 
case, I expect that the embedded newlines will usually match those of 
the CSV file, in which case there will be no failure. It's a very big 
step from there to the far more general can't load data exported from 
Excel. Or did you have some other limitation in mind?

FWIW, I don't make a habit of using multiline fields in my spreadsheets 
- and some users I have spoken to aren't even aware that you can have 
them at all.

cheers
andrew
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Ben . Young
 
 
 Greg Stark wrote:
 
 Personally I find the current CSV support inadequate. It seems 
pointless to
 support CSV if it can't load data exported from Excel, which seems like 
the
 main use case. 
  
 
 
 OK, I'm starting to get mildly annoyed now. We have identified one 
 failure case connected with multiline fields. Even in the multiline 
 case, I expect that the embedded newlines will usually match those of 
 the CSV file, in which case there will be no failure. It's a very big 
 step from there to the far more general can't load data exported from 
 Excel. Or did you have some other limitation in mind?
 
 FWIW, I don't make a habit of using multiline fields in my spreadsheets 
 - and some users I have spoken to aren't even aware that you can have 
 them at all.
 

I am normally more of a lurker on these lists, but I thought you had 
better know 
that when we developed CSV import/export for an application at my last 
company
we discovered that Excel can't always even read the CSV that _it_ has 
output! 
(With embedded newlines a particular problem) 

It is far more reliable if you output your data as an HTML table, in which 
case it 
practically always gets it right. Perhaps Postgres could support this as 
an import/
export mechanism as I have found it to be far less error prone than CSV?

Cheers,
Ben Young


 cheers
 
 andrew
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faqs/FAQ.html


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Andrew Dunstan

[EMAIL PROTECTED] wrote:
I am normally more of a lurker on these lists, but I thought you had 
better know 
that when we developed CSV import/export for an application at my last 
company
we discovered that Excel can't always even read the CSV that _it_ has 
output! 
(With embedded newlines a particular problem) 
 


Quite so.  This is not about being perfect, and there will be failures. 
But, absent this problem the feature should work reasonably well.

Note that Excel is not the only kid on the block - if it were there 
would be a good case for avoiding CSV anyway and instead reading/writing 
the excel files directly. We included support for CSVs because, 
nothwithstanding how braindead the format is, is is still widely used in 
data exchange.

It is far more reliable if you output your data as an HTML table, in which 
case it 
practically always gets it right. Perhaps Postgres could support this as 
an import/
export mechanism as I have found it to be far less error prone than CSV?
 

That's probably possibly but rather ugly. A well defined XML format 
would be far better (don't we have something like that in contrib?). 
Things for a bulk-export facility, I think.

cheers
andrew

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Greg Stark

Andrew Dunstan [EMAIL PROTECTED] writes:

 FWIW, I don't make a habit of using multiline fields in my spreadsheets - and
 some users I have spoken to aren't even aware that you can have them at all.

Unfortunately I don't get a choice. I offer a field on the web site where
users can upload an excel sheet. Some of the fields of my database are
expected to have multi-line text in them. So I expect virtually all the
uploads to have multi-line fields in them.

So as far as I'm concerned, this import system is simply unusable. I have to
write a program to do the import. Since I was always planning to write such a
program I'm not too disappointed though.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Bruce Momjian
Andrew Dunstan wrote:
 
 
 Greg Stark wrote:
 
 Personally I find the current CSV support inadequate. It seems pointless to
 support CSV if it can't load data exported from Excel, which seems like the
 main use case. 
   
 
 
 OK, I'm starting to get mildly annoyed now. We have identified one 
 failure case connected with multiline fields. Even in the multiline 
 case, I expect that the embedded newlines will usually match those of 
 the CSV file, in which case there will be no failure. It's a very big 
 step from there to the far more general can't load data exported from 
 Excel. Or did you have some other limitation in mind?
 
 FWIW, I don't make a habit of using multiline fields in my spreadsheets 
 - and some users I have spoken to aren't even aware that you can have 
 them at all.

I am wondering if one good solution would be to pre-process the input
stream in copy.c to convert newline to \n and carriage return to \r and
double data backslashes and tell copy.c to interpret those like it does
for normal text COPY files.  That way, the changes to copy.c might be
minimal; basically, place a filter in front of the CSV file that cleans
up the input so it can be more easily processed.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Andrew Dunstan

Bruce Momjian wrote:
I am wondering if one good solution would be to pre-process the input
stream in copy.c to convert newline to \n and carriage return to \r and
double data backslashes and tell copy.c to interpret those like it does
for normal text COPY files.  That way, the changes to copy.c might be
minimal; basically, place a filter in front of the CSV file that cleans
up the input so it can be more easily processed.
 

This would have to parse the input stream, because you would need to 
know which CRs and LFs were part of the data stream and so should be 
escaped, and which really ended data lines and so should be left alone. 
However, while the idea is basically sound, parsing the stream twice 
seems crazy. My argument has been that at this stage in the dev cycle we 
should document the limitation, maybe issue a warning as you want, and 
make the more invasive code changes to fix it properly in 8.1. If you 
don't want to wait, then following your train of thought a bit, ISTM 
that the correct solution is a routine for CSV mode that combines the 
functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a 
genuine and fast fix for Greg's and Darcy's problem.

cheers
andrew
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-30 Thread Bruce Momjian
Andrew Dunstan wrote:
 
 
 Bruce Momjian wrote:
 
 I am wondering if one good solution would be to pre-process the input
 stream in copy.c to convert newline to \n and carriage return to \r and
 double data backslashes and tell copy.c to interpret those like it does
 for normal text COPY files.  That way, the changes to copy.c might be
 minimal; basically, place a filter in front of the CSV file that cleans
 up the input so it can be more easily processed.
   
 
 
 This would have to parse the input stream, because you would need to 
 know which CRs and LFs were part of the data stream and so should be 
 escaped, and which really ended data lines and so should be left alone. 
 However, while the idea is basically sound, parsing the stream twice 
 seems crazy. My argument has been that at this stage in the dev cycle we 
 should document the limitation, maybe issue a warning as you want, and 
 make the more invasive code changes to fix it properly in 8.1. If you 

OK, right.

 don't want to wait, then following your train of thought a bit, ISTM 
 that the correct solution is a routine for CSV mode that combines the 
 functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a 
 genuine and fast fix for Greg's and Darcy's problem.

We are fine for 8.0, except for the warning, and you think we can fix it
perfectly in 8.1, good.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Which we do not have, because pg_dump doesn't use CSV.  I do not think
 this is a must-fix, especially not if the proposed fix introduces
 inconsistencies elsewhere.

 Sure, pg_dump doesn't use it but COPY should be able to load anything it
 output.

I'd buy into that proposition if CSV showed any evidence of being a
sanely defined format, but it shows every indication of being neither
well-defined, nor self-consistent, nor even particularly portable.
I suggest adjusting your expectations.  All I expect from that code is
being able to load the majority of data from the more popular Microsloth
applications.  Trying to achieve 100% consistency for corner cases is
just going to interfere with the real use-case for the feature, which is
coping with output from applications that aren't very consistent in the
first place.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Which we do not have, because pg_dump doesn't use CSV.  I do not think
  this is a must-fix, especially not if the proposed fix introduces
  inconsistencies elsewhere.
 
  Sure, pg_dump doesn't use it but COPY should be able to load anything it
  output.
 
 I'd buy into that proposition if CSV showed any evidence of being a
 sanely defined format, but it shows every indication of being neither
 well-defined, nor self-consistent, nor even particularly portable.
 I suggest adjusting your expectations.  All I expect from that code is
 being able to load the majority of data from the more popular Microsloth
 applications.  Trying to achieve 100% consistency for corner cases is
 just going to interfere with the real use-case for the feature, which is
 coping with output from applications that aren't very consistent in the
 first place.

OK, then should we disallow dumping out data in CVS format that we can't
load?  Seems like the least we should do for 8.0.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Andrew Dunstan

Bruce Momjian wrote:
Tom Lane wrote:
 

Bruce Momjian [EMAIL PROTECTED] writes:
   

Tom Lane wrote:
 

Which we do not have, because pg_dump doesn't use CSV.  I do not think
this is a must-fix, especially not if the proposed fix introduces
inconsistencies elsewhere.
   

Sure, pg_dump doesn't use it but COPY should be able to load anything it
output.
 

I'd buy into that proposition if CSV showed any evidence of being a
sanely defined format, but it shows every indication of being neither
well-defined, nor self-consistent, nor even particularly portable.
I suggest adjusting your expectations.  All I expect from that code is
being able to load the majority of data from the more popular Microsloth
applications.  Trying to achieve 100% consistency for corner cases is
just going to interfere with the real use-case for the feature, which is
coping with output from applications that aren't very consistent in the
first place.
   

OK, then should we disallow dumping out data in CVS format that we can't
load?  Seems like the least we should do for 8.0.
 

As Tom rightly points out, having data make the round trip was not the 
goal of the exercise. Excel, for example, has no trouble reading such 
data (or at least my installation of it).

Personally I consider CSVs with line end chars embedded in fields to be 
broken anyway, but this was something that was specifically mentioned 
when we were discussing requirements, which is why I coded for it.

cheers
andrew

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Bruce Momjian
Andrew Dunstan wrote:
 OK, then should we disallow dumping out data in CVS format that we can't
 load?  Seems like the least we should do for 8.0.
 
   
 
 
 As Tom rightly points out, having data make the round trip was not the 
 goal of the exercise. Excel, for example, has no trouble reading such 
 data (or at least my installation of it).
 
 Personally I consider CSVs with line end chars embedded in fields to be 
 broken anyway, but this was something that was specifically mentioned 
 when we were discussing requirements, which is why I coded for it.

OK, I am pretty uncomforable with this but you know this usage better
than I do.  Should we issue a warning message stating it will not be
able to be reloaded?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Bruce Momjian
Bruce Momjian wrote:
 Andrew Dunstan wrote:
  OK, then should we disallow dumping out data in CVS format that we can't
  load?  Seems like the least we should do for 8.0.
  

  
  
  As Tom rightly points out, having data make the round trip was not the 
  goal of the exercise. Excel, for example, has no trouble reading such 
  data (or at least my installation of it).
  
  Personally I consider CSVs with line end chars embedded in fields to be 
  broken anyway, but this was something that was specifically mentioned 
  when we were discussing requirements, which is why I coded for it.
 
 OK, I am pretty uncomforable with this but you know this usage better
 than I do.  Should we issue a warning message stating it will not be
 able to be reloaded?

Also, can you explain why we can't read across a newline to the next
quote?  Is it a problem with the way our code is structured or is it a
logical problem?  Someone mentioned multibyte encodings but I don't
understand how that applies here.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Andrew Dunstan

Bruce Momjian wrote:
Andrew Dunstan wrote:
 

OK, then should we disallow dumping out data in CVS format that we can't
load?  Seems like the least we should do for 8.0.

 

As Tom rightly points out, having data make the round trip was not the 
goal of the exercise. Excel, for example, has no trouble reading such 
data (or at least my installation of it).

Personally I consider CSVs with line end chars embedded in fields to be 
broken anyway, but this was something that was specifically mentioned 
when we were discussing requirements, which is why I coded for it.
   

OK, I am pretty uncomforable with this but you know this usage better
than I do.  Should we issue a warning message stating it will not be
able to be reloaded?
 

If it bothers you that much. I'd make a flag, cleared at the start of 
each COPY, and then where we test for CR or LF in CopyAttributeOutCSV, 
if the flag is not set then set it and issue the warning.

Longer term I'd like to be able to have a command parameter that 
specifies certain fields as multiline and for those relax the line end 
matching restriction (and for others forbid multiline altogether). That 
would be a TODO for 8.1 though, along with optional special handling for 
first line column headings.

cheers
andrew
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Andrew Dunstan

Bruce Momjian wrote:
Also, can you explain why we can't read across a newline to the next
quote?  Is it a problem with the way our code is structured or is it a
logical problem?  Someone mentioned multibyte encodings but I don't
understand how that applies here.
 

In a CSV file, each line is a record. Reading across a newline for the 
next quote (assuming the next field is quoted) would mean stealing 
fields from the next record.

I did see one complaint about missing or extra fields at the end of a 
record - I think it is reasonable for us to expect the data to be 
rectangular, and not ragged.

(I hope this answers your question - I am not 100% certain I understaood 
it).

cheers
andrew
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Also, can you explain why we can't read across a newline to the next
 quote?  Is it a problem with the way our code is structured or is it a
 logical problem?

It's a structural issue in the sense that we separate the act of
dividing the input into rows from the act of dividing it into columns.
I do not think that separation is wrong however.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Kris Jurka


On Mon, 29 Nov 2004, Andrew Dunstan wrote:

 Longer term I'd like to be able to have a command parameter that 
 specifies certain fields as multiline and for those relax the line end 
 matching restriction (and for others forbid multiline altogether). That 
 would be a TODO for 8.1 though, along with optional special handling for 
 first line column headings.
 

Endlessly extending the COPY command doesn't seem like a winning 
proposition to me and I think if we aren't comfortable telling every user 
to write a script to pre/post-process the data we should instead provide a 
bulk loader/unloader that transforms things to our limited COPY 
functionality.  There are all kinds of feature requests I've seen 
along these lines that would make COPY a million option mess if we try to 
support all of it directly.

- skipping header rows
- skipping certain data file columns
- specifying date formats
- ignoring duplicates
- outputting an arbitrary SELECT statement

Kris Jurka


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Bruce Momjian
Kris Jurka wrote:
 
 
 On Mon, 29 Nov 2004, Andrew Dunstan wrote:
 
  Longer term I'd like to be able to have a command parameter that 
  specifies certain fields as multiline and for those relax the line end 
  matching restriction (and for others forbid multiline altogether). That 
  would be a TODO for 8.1 though, along with optional special handling for 
  first line column headings.
  
 
 Endlessly extending the COPY command doesn't seem like a winning 
 proposition to me and I think if we aren't comfortable telling every user 
 to write a script to pre/post-process the data we should instead provide a 
 bulk loader/unloader that transforms things to our limited COPY 
 functionality.  There are all kinds of feature requests I've seen 
 along these lines that would make COPY a million option mess if we try to 
 support all of it directly.
 
 - skipping header rows
 - skipping certain data file columns
 - specifying date formats
 - ignoring duplicates
 - outputting an arbitrary SELECT statement

Agreed. There are lots of wishes for COPY and it will become bloated if
we do them all.

I am concerned someone will say, Oh, I know the CSV format and I might
load the data into another database someday so I will always use CVS
not knowing it isn't a 100% consistent format.  I think we need to
issues a warning if a \r or \n is output by COPY CSV just so people
understand the limitation.  We can then reevaluate where we need to go
for 8.1.

Open item updated:

* warn on COPY TO ... CSV with \r,\n in data

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Tom Lane
Kris Jurka [EMAIL PROTECTED] writes:
 Endlessly extending the COPY command doesn't seem like a winning 
 proposition to me and I think if we aren't comfortable telling every user 
 to write a script to pre/post-process the data we should instead provide a 
 bulk loader/unloader that transforms things to our limited COPY 
 functionality.  There are all kinds of feature requests I've seen 
 along these lines that would make COPY a million option mess if we try to 
 support all of it directly.

I agree completely --- personally I'd not have put CSV into the backend
either.

IIRC we already have a TODO item for a separate bulk loader, but no
one's stepped up to the plate yet :-(

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Bruce Momjian
Tom Lane wrote:
 Kris Jurka [EMAIL PROTECTED] writes:
  Endlessly extending the COPY command doesn't seem like a winning 
  proposition to me and I think if we aren't comfortable telling every user 
  to write a script to pre/post-process the data we should instead provide a 
  bulk loader/unloader that transforms things to our limited COPY 
  functionality.  There are all kinds of feature requests I've seen 
  along these lines that would make COPY a million option mess if we try to 
  support all of it directly.
 
 I agree completely --- personally I'd not have put CSV into the backend
 either.

What pushed us was the large number of request for it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Andrew Dunstan

Tom Lane wrote:
Kris Jurka [EMAIL PROTECTED] writes:
 

Endlessly extending the COPY command doesn't seem like a winning 
proposition to me and I think if we aren't comfortable telling every user 
to write a script to pre/post-process the data we should instead provide a 
bulk loader/unloader that transforms things to our limited COPY 
functionality.  There are all kinds of feature requests I've seen 
along these lines that would make COPY a million option mess if we try to 
support all of it directly.
   

I agree completely --- personally I'd not have put CSV into the backend
either.
IIRC we already have a TODO item for a separate bulk loader, but no
one's stepped up to the plate yet :-(
IIRC, the way it happened was that a proposal was made to do CSV 
import/export in a fairly radical way, I countered with a much more modest 
approach, which was generally accepted and which Bruce and I then implemented, 
not without some angst (as well as a little sturm und drang).
The advantage of having it in COPY is that it can be done serverside 
direct from the file system. For massive bulk loads that might be a 
plus, although I don't know what the protocol+socket overhead is. Maybe 
it would just be lost in the noise. Certainly I can see some sense in 
having COPY deal with straightforward cases and a bulk-load-unload 
program in bin to handle the hairier cases. Multiline fields would come 
into that category. The bulk-load-unload facility could possibly handle 
things other than CSV format too (XML anyone?). The nice thing about an 
external program is that it would not have to handle data embedded in an 
SQL stream, so the dangers from shifts in newline style, missing quotes, 
and the like would be far lower.

We do need to keep things in perspective a bit. The small wrinkle that 
has spawned this whole thread will not affect most users of the facility 
- and many many users will thanks us for having provided it.

cheers
andrew
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-29 Thread Greg Stark

Andrew Dunstan [EMAIL PROTECTED] writes:

 The advantage of having it in COPY is that it can be done serverside direct
 from the file system. For massive bulk loads that might be a plus, although I
 don't know what the protocol+socket overhead is. 

Actually even if you use client-side COPY it's *still* more efficient than any
more general client-side alternative.

As Tom pointed out to me a while back, neither the protocol nor libpq allow
for having multiple queries in flight simultaneously. That makes it impossible
to stream large quantities of data to the server efficiently. Each record
requires a round-trip and context switch overhead.

In an ideal world the client should be able to queue up enough records to fill
the socket buffers and allow the kernel to switch to a more batch oriented
context switch mode where the server can process large numbers of records
before switching back to the client. Ideally this would apply to any kind of
query execution.

But as a kind of short cut towards this for bulk loading I'm curious whether
it would be possible to adopt a sort of batch execution mode where a statement
is prepared, then parameters to the statement are streamed to the server in a
kind of COPY mode. It would have to be some format that allowed for embedded
newlines though; there's just no point in an interface that can't handle
arbitrary data.

Personally I find the current CSV support inadequate. It seems pointless to
support CSV if it can't load data exported from Excel, which seems like the
main use case. But I always thought bulk loading should be from some external
application anyways. The problem is that there isn't any interface suitable
for an external application to use.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-28 Thread Bruce Momjian

OK, what solutions do we have for this?  Not being able to load dumped
data is a serious bug.  I have added this to the open items list:

* fix COPY CSV with \r,\n in data

My feeling is that if we are in a quoted string we just process whatever
characters we find, even passing through an EOL.  I realize it might not
mark missing quote errors well but that seems minor compared to not
loading valid data.

---

Andrew Dunstan wrote:
 
 This example should fail  on data line 2 or 3 on any platform, 
 regardless of the platform's line-end convention, although I haven't 
 tested on Windows.
 
 cheers
 
 andrew
 
 [EMAIL PROTECTED] inst]$ bin/psql -e -f csverr.sql ; od -c 
 /tmp/csverrtest.csv
 create table csverrtest (a int, b text, c int);
 CREATE TABLE
 insert into csverrtest values(1,'a',1);
 INSERT 122471 1
 insert into csverrtest values(2,'foo\r\nbar',2);
 INSERT 122472 1
 insert into csverrtest values(3,'baz\nblurfl',3);
 INSERT 122473 1
 insert into csverrtest values(4,'d',4);
 INSERT 122474 1
 insert into csverrtest values(5,'e',5);
 INSERT 122475 1
 copy csverrtest to '/tmp/csverrtest.csv' csv;
 COPY
 truncate csverrtest;
 TRUNCATE TABLE
 copy csverrtest from '/tmp/csverrtest.csv' csv;
 psql:cvserr.sql:9: ERROR:  literal carriage return found in data
 HINT:  Use \r to represent carriage return.
 CONTEXT:  COPY csverrtest, line 2: 2,foo
 drop table csverrtest;
 DROP TABLE
 000   1   ,   a   ,   1  \n   2   ,  f   o   o  \r  \n   b   a
 020   r  ,   2  \n   3   ,  b   a   z  \n   b   l   u   r
 040   f   l  ,   3  \n   4   ,   d   ,   4  \n   5   ,   e   ,
 060   5  \n
 062
 [EMAIL PROTECTED] inst]$
 
 Bruce Momjian wrote:
 
 Can I see an example of such a failure line?
 
 ---
 
 Andrew Dunstan wrote:
   
 
 Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
 COPY CSV with fields containing embedded line end chars if the embedded 
 sequence isn't the same as those of the file containing the CSV data. In 
 that case we error out when reading the data in. This means there are 
 cases where we can produce a CSV data file which we can't read in, which 
 is not at all pleasant.
 
 Possible approaches to the problem:
 . make it a documented limitation
 . have a csv read mode for backend/commands/copy.c:CopyReadLine() that 
 relaxes some of the restrictions on inconsistent line endings
 . escape embedded line end chars
 
 The last really isn't an option, because the whole point of CSVs is to 
 play with other programs, and my understanding is that those that 
 understand multiline fields (e.g. Excel) expect them not to be escaped, 
 and do not produce them escaped.
 
 So right now I'm tossing up in my head between the first two options. Or 
 maybe there's another solution I haven't thought of.
 
 Thoughts?
 
 cheers
 
 andrew
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
 
 
 
 
   
 
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-28 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 OK, what solutions do we have for this?  Not being able to load dumped
 data is a serious bug.

Which we do not have, because pg_dump doesn't use CSV.  I do not think
this is a must-fix, especially not if the proposed fix introduces
inconsistencies elsewhere.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] multiline CSV fields

2004-11-28 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  OK, what solutions do we have for this?  Not being able to load dumped
  data is a serious bug.
 
 Which we do not have, because pg_dump doesn't use CSV.  I do not think
 this is a must-fix, especially not if the proposed fix introduces
 inconsistencies elsewhere.

Sure, pg_dump doesn't use it but COPY should be able to load anything it
output.

Can this be fixed if we ignore the problem with reporting errors?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-28 Thread Andrew Dunstan
Bruce Momjian said:
 Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  OK, what solutions do we have for this?  Not being able to load
  dumped data is a serious bug.

 Which we do not have, because pg_dump doesn't use CSV.  I do not think
 this is a must-fix, especially not if the proposed fix introduces
 inconsistencies elsewhere.

 Sure, pg_dump doesn't use it but COPY should be able to load anything
 it output.

 Can this be fixed if we ignore the problem with reporting errors?



When I looked at it I could not see any simple fix that was not worse than
the symptom. If the asymmetry offends you, then we could do as Tom suggested
and rip out the multiline processing completely for now. Personally I would
regard that as a pity, as it would disallow many cases that will work quite
happily as we are, and because this is a feature that was requested when we
did this work. The limitation has been documented - my incliniation would be
to revisit this during the 8.1 dev cycle.

cheers

andrew



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] multiline CSV fields

2004-11-12 Thread Andrew Dunstan
This example should fail  on data line 2 or 3 on any platform, 
regardless of the platform's line-end convention, although I haven't 
tested on Windows.

cheers
andrew
[EMAIL PROTECTED] inst]$ bin/psql -e -f csverr.sql ; od -c 
/tmp/csverrtest.csv
create table csverrtest (a int, b text, c int);
CREATE TABLE
insert into csverrtest values(1,'a',1);
INSERT 122471 1
insert into csverrtest values(2,'foo\r\nbar',2);
INSERT 122472 1
insert into csverrtest values(3,'baz\nblurfl',3);
INSERT 122473 1
insert into csverrtest values(4,'d',4);
INSERT 122474 1
insert into csverrtest values(5,'e',5);
INSERT 122475 1
copy csverrtest to '/tmp/csverrtest.csv' csv;
COPY
truncate csverrtest;
TRUNCATE TABLE
copy csverrtest from '/tmp/csverrtest.csv' csv;
psql:cvserr.sql:9: ERROR:  literal carriage return found in data
HINT:  Use \r to represent carriage return.
CONTEXT:  COPY csverrtest, line 2: 2,foo
drop table csverrtest;
DROP TABLE
000   1   ,   a   ,   1  \n   2   ,  f   o   o  \r  \n   b   a
020   r  ,   2  \n   3   ,  b   a   z  \n   b   l   u   r
040   f   l  ,   3  \n   4   ,   d   ,   4  \n   5   ,   e   ,
060   5  \n
062
[EMAIL PROTECTED] inst]$

Bruce Momjian wrote:
Can I see an example of such a failure line?
---
Andrew Dunstan wrote:
 

Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
COPY CSV with fields containing embedded line end chars if the embedded 
sequence isn't the same as those of the file containing the CSV data. In 
that case we error out when reading the data in. This means there are 
cases where we can produce a CSV data file which we can't read in, which 
is not at all pleasant.

Possible approaches to the problem:
. make it a documented limitation
. have a csv read mode for backend/commands/copy.c:CopyReadLine() that 
relaxes some of the restrictions on inconsistent line endings
. escape embedded line end chars

The last really isn't an option, because the whole point of CSVs is to 
play with other programs, and my understanding is that those that 
understand multiline fields (e.g. Excel) expect them not to be escaped, 
and do not produce them escaped.

So right now I'm tossing up in my head between the first two options. Or 
maybe there's another solution I haven't thought of.

Thoughts?
cheers
andrew
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
   

 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-11-12 Thread Patrick B Kelly
On Nov 12, 2004, at 12:20 AM, Tom Lane wrote:
Patrick B Kelly [EMAIL PROTECTED] writes:
I may not be explaining myself well or I may fundamentally
misunderstand how copy works.
Well, you're definitely ignoring the character-set-conversion issue.
I was not trying to ignore the character set and encoding issues but 
perhaps my assumptions are naive or overly optimistic. I realized that 
quotes are not as consistent as the NL characters but I was assuming 
that some encodings would escape to ASCII or a similar encoding like 
JIS Roman that would simplify recognition of the quote character. 
Unicode files make recognizing other punctuation like the quote fairly 
straightforward and to the naive observer, the code in CopyReadLine as 
it is currently written appears to handle multi-byte encodings such as 
SJIS that may present characters below 127 in trailing bytes.

As I said, perhaps I was oversimplifying. Is there a regression test 
set of input files for that I could review to see all of the supported 
encodings?

Patrick B. Kelly
--
  http://patrickbkelly.org
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Patrick B Kelly
On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:
The last really isn't an option, because the whole point of CSVs is to 
play with other programs, and my understanding is that those that 
understand multiline fields (e.g. Excel) expect them not to be 
escaped, and do not produce them escaped.

Actually, when I try to export a sheet with multi-line cells from 
excel, it tells me that this feature is incompatible with the CSV 
format and will not include them in the CSV file.

Patrick B. Kelly
--
  http://patrickbkelly.org
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Andrew Dunstan

Patrick B Kelly wrote:
On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:
The last really isn't an option, because the whole point of CSVs is 
to play with other programs, and my understanding is that those that 
understand multiline fields (e.g. Excel) expect them not to be 
escaped, and do not produce them escaped.

Actually, when I try to export a sheet with multi-line cells from 
excel, it tells me that this feature is incompatible with the CSV 
format and will not include them in the CSV file.


It probably depends on the version. I have just tested with Excel 2000 
on a WinXP machine and it both read and wrote these files.

cheers
andrew
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Patrick B Kelly wrote:
 Actually, when I try to export a sheet with multi-line cells from 
 excel, it tells me that this feature is incompatible with the CSV 
 format and will not include them in the CSV file.

 It probably depends on the version. I have just tested with Excel 2000 
 on a WinXP machine and it both read and wrote these files.

I'd be inclined to define Excel 2000 as broken, honestly, if it's
writing unescaped newlines as data.  To support this would mean throwing
away most of our ability to detect incorrectly formatted CSV files.
A simple error like a missing close quote would look to the machine like
the rest of the file is a single long data line where all the newlines
are embedded in data fields.  How likely is it that you'll get a useful
error message out of that?  Most likely the error message would point to
the end of the file, or at least someplace well removed from the actual
mistake.

I would vote in favor of removing the current code that attempts to
support unquoted newlines, and waiting to see if there are complaints.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Andrew Dunstan

Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
 

Patrick B Kelly wrote:
   

Actually, when I try to export a sheet with multi-line cells from 
excel, it tells me that this feature is incompatible with the CSV 
format and will not include them in the CSV file.
 

 

It probably depends on the version. I have just tested with Excel 2000 
on a WinXP machine and it both read and wrote these files.
   

I'd be inclined to define Excel 2000 as broken, honestly, if it's
writing unescaped newlines as data.  To support this would mean throwing
away most of our ability to detect incorrectly formatted CSV files.
A simple error like a missing close quote would look to the machine like
the rest of the file is a single long data line where all the newlines
are embedded in data fields.  How likely is it that you'll get a useful
error message out of that?  Most likely the error message would point to
the end of the file, or at least someplace well removed from the actual
mistake.
I would vote in favor of removing the current code that attempts to
support unquoted newlines, and waiting to see if there are complaints.
 

This feature was specifically requested when we discussed what sort of 
CSVs we would handle.

And it does in fact work as long as the newline style is the same.
I just had an idea. How about if we add a new CSV option MULTILINE. If 
absent, then on output we would not output unescaped LF/CR characters 
and on input we would not allow fields with embedded unescaped LF/CR 
characters. In both cases we could error out for now, with perhaps an 
8.1 TODO to provide some other behaviour.

Or we could drop the whole multiline feature for now and make the 
whole thing an 8.1 item, although it would be a bit of a pity when it 
does work in what will surely be the most common case.

cheers
andrew

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 I would vote in favor of removing the current code that attempts to
 support unquoted newlines, and waiting to see if there are complaints.

Uhm. *raises hand*

I agree with your argument but one way or another I have to load these CSVs
I'm given. And like it or not virtually all the CSVs people get are going to
be coming from Excel. So far with 7.4 I've just opened them up in Emacs and
removed the newlines, but it's a royal pain in the arse.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread David Fetter
On Thu, Nov 11, 2004 at 03:38:16PM -0500, Greg Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  I would vote in favor of removing the current code that attempts
  to support unquoted newlines, and waiting to see if there are
  complaints.
 
 Uhm. *raises hand*
 
 I agree with your argument but one way or another I have to load
 these CSVs I'm given. And like it or not virtually all the CSVs
 people get are going to be coming from Excel. So far with 7.4 I've
 just opened them up in Emacs and removed the newlines, but it's a
 royal pain in the arse.

Meanwhile, check out dbi-link.  It lets you query against DBI data
sources including DBD::Excel :)

http://pgfoundry.org/projects/dbi-link/

Bug reports welcome.

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Patrick B Kelly
On Nov 11, 2004, at 2:56 PM, Andrew Dunstan wrote:

Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Patrick B Kelly wrote:
Actually, when I try to export a sheet with multi-line cells from 
excel, it tells me that this feature is incompatible with the CSV 
format and will not include them in the CSV file.


It probably depends on the version. I have just tested with Excel 
2000 on a WinXP machine and it both read and wrote these files.

I'd be inclined to define Excel 2000 as broken, honestly, if it's
writing unescaped newlines as data.  To support this would mean 
throwing
away most of our ability to detect incorrectly formatted CSV files.
A simple error like a missing close quote would look to the machine 
like
the rest of the file is a single long data line where all the newlines
are embedded in data fields.  How likely is it that you'll get a 
useful
error message out of that?  Most likely the error message would point 
to
the end of the file, or at least someplace well removed from the 
actual
mistake.

I would vote in favor of removing the current code that attempts to
support unquoted newlines, and waiting to see if there are complaints.

This feature was specifically requested when we discussed what sort of 
CSVs we would handle.

And it does in fact work as long as the newline style is the same.
I just had an idea. How about if we add a new CSV option MULTILINE. If 
absent, then on output we would not output unescaped LF/CR characters 
and on input we would not allow fields with embedded unescaped LF/CR 
characters. In both cases we could error out for now, with perhaps an 
8.1 TODO to provide some other behaviour.

Or we could drop the whole multiline feature for now and make the 
whole thing an 8.1 item, although it would be a bit of a pity when it 
does work in what will surely be the most common case.

What about just coding a FSM into 
backend/commands/copy.c:CopyReadLine() that does not process any flavor 
of NL characters when it is inside of a data field?

Patrick B. Kelly
--
  http://patrickbkelly.org
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Andrew Dunstan

Patrick B Kelly wrote:

What about just coding a FSM into 
backend/commands/copy.c:CopyReadLine() that does not process any 
flavor of NL characters when it is inside of a data field?


It would be a major change - the routine doesn't read data a field at a 
time, and has no idea if we are even in CSV mode at all. It would be 
rather late in the dev cycle to be making such changes, I suspect.

cheers
andrew
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Tom Lane
Patrick B Kelly [EMAIL PROTECTED] writes:
 What about just coding a FSM into 
 backend/commands/copy.c:CopyReadLine() that does not process any flavor 
 of NL characters when it is inside of a data field?

CopyReadLine has no business tracking that.  One reason why not is that
it is dealing with data not yet converted out of the client's encoding,
which makes matching to user-specified quote/escape characters
difficult.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Patrick B Kelly
On Nov 11, 2004, at 6:16 PM, Tom Lane wrote:
Patrick B Kelly [EMAIL PROTECTED] writes:
What about just coding a FSM into
backend/commands/copy.c:CopyReadLine() that does not process any 
flavor
of NL characters when it is inside of a data field?
CopyReadLine has no business tracking that.  One reason why not is that
it is dealing with data not yet converted out of the client's encoding,
which makes matching to user-specified quote/escape characters
difficult.
regards, tom lane
---(end of 
broadcast)---
TIP 7: don't forget to increase your free space map settings


I appreciate what you are saying about the encoding and you are, of 
course, right but CopyReadLine is already processing the NL characters 
and it is doing it without considering the context in which they 
appear. Unfortunately, the same character(s) are used for two different 
purposes in the files in question. Without considering whether they 
appear inside or outside of data fields, CopyReadline will mistake one 
for the other and cannot correctly do what it is already trying to do 
which is break the input file into lines.

My suggestion is to simply have CopyReadLine recognize these two states 
(in-field and out-of-field) and execute the current logic only while in 
the second state. It would not be too hard but as you mentioned it is 
non-trivial.

Patrick B. Kelly
--
  http://patrickbkelly.org
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Andrew Dunstan

Patrick B Kelly wrote:

My suggestion is to simply have CopyReadLine recognize these two 
states (in-field and out-of-field) and execute the current logic only 
while in the second state. It would not be too hard but as you 
mentioned it is non-trivial.


We don't know what state we expect the end of line to be in until after 
we have actually read the line. To know how to treat the end of line on 
your scheme we would have to parse as we go rather than after reading 
the line as now. Changing this would be not only be non-trivial but 
significantly invasive to the code.

cheers
andrew
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Patrick B Kelly
On Nov 11, 2004, at 10:07 PM, Andrew Dunstan wrote:

Patrick B Kelly wrote:

My suggestion is to simply have CopyReadLine recognize these two 
states (in-field and out-of-field) and execute the current logic only 
while in the second state. It would not be too hard but as you 
mentioned it is non-trivial.


We don't know what state we expect the end of line to be in until 
after we have actually read the line. To know how to treat the end of 
line on your scheme we would have to parse as we go rather than after 
reading the line as now. Changing this would be not only be 
non-trivial but significantly invasive to the code.


Perhaps I am misunderstanding the code. As I read it the code currently 
goes through the input character by character looking for NL and EOF 
characters. It appears to be very well structured for what I am 
proposing. The section in question is a small and clearly defined loop 
which reads the input one character at a time and decides when it has 
reached the end of the line or file. Each call of CopyReadLine attempts 
to get one more line. I would propose that each time it starts out in 
the out-of-field state and the state is toggled by each un-escaped 
quote that it encounters in the stream. When in the in-field state, it 
would only look for the next un-escaped quote and while in the 
out-of-field state, it would execute the existing logic as well as 
looking for the next un-escaped quote.

I may not be explaining myself well or I may fundamentally 
misunderstand how copy works. I would be happy to code the change and 
send it to you for review, if you would be interested in looking it 
over and it is felt to be a worthwhile capability.


Patrick B. Kelly
--
  http://patrickbkelly.org
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Bruce Momjian

Can I see an example of such a failure line?

---

Andrew Dunstan wrote:
 
 Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
 COPY CSV with fields containing embedded line end chars if the embedded 
 sequence isn't the same as those of the file containing the CSV data. In 
 that case we error out when reading the data in. This means there are 
 cases where we can produce a CSV data file which we can't read in, which 
 is not at all pleasant.
 
 Possible approaches to the problem:
 . make it a documented limitation
 . have a csv read mode for backend/commands/copy.c:CopyReadLine() that 
 relaxes some of the restrictions on inconsistent line endings
 . escape embedded line end chars
 
 The last really isn't an option, because the whole point of CSVs is to 
 play with other programs, and my understanding is that those that 
 understand multiline fields (e.g. Excel) expect them not to be escaped, 
 and do not produce them escaped.
 
 So right now I'm tossing up in my head between the first two options. Or 
 maybe there's another solution I haven't thought of.
 
 Thoughts?
 
 cheers
 
 andrew
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] multiline CSV fields

2004-11-11 Thread Tom Lane
Patrick B Kelly [EMAIL PROTECTED] writes:
 I may not be explaining myself well or I may fundamentally 
 misunderstand how copy works.

Well, you're definitely ignoring the character-set-conversion issue.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend