Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-27 Thread Peter Eisentraut
On mån, 2011-09-26 at 21:49 +0300, Peter Eisentraut wrote: If I store a BOM in row 1, column 1 of my table, because, well, maybe it's an XML document or something, then it needs to be able to survive a copy out and in. The only way we could proceed with this would be if we prohibited BOMs in

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-27 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: Alternative consideration: We could allow this in CSV format if we made users quote the first value if it starts with a BOM. This might be a reasonable way to get MS compatibility. I don't think we can get away with a retroactive restriction on the

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread David E. Wheeler
On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote: I'd like to support UTF-8 text or csv files that has BOM (byte order mark) in COPY FROM command. BOM will be automatically detected and ignored if the file encoding is UTF-8. WIP patch attached. By my reading of

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Magnus Hagander
On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro itagaki.takah...@gmail.com wrote: Hi, I'd like to support UTF-8 text or csv files that has BOM (byte order mark) in COPY FROM command. BOM will be automatically detected and ignored if the file encoding is UTF-8. WIP patch attached. I'm

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Itagaki Takahiro
On Mon, Sep 26, 2011 at 20:12, Magnus Hagander mag...@hagander.net wrote: I like it in general. But if we're looking at the BOM, shouldn't we also look and *reject* the file if it's a BOM for a non-UTF8 file? Say if the BOM claims it's UTF16? -1 because we're depending on manual configuration

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Magnus Hagander
On Mon, Sep 26, 2011 at 13:36, Itagaki Takahiro itagaki.takah...@gmail.com wrote: On Mon, Sep 26, 2011 at 20:12, Magnus Hagander mag...@hagander.net wrote: I like it in general. But if we're looking at the BOM, shouldn't we also look and *reject* the file if it's a BOM for a non-UTF8 file? Say

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Andrew Dunstan
On 09/26/2011 07:12 AM, Magnus Hagander wrote: On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro itagaki.takah...@gmail.com wrote: Hi, I'd like to support UTF-8 text or csv files that has BOM (byte order mark) in COPY FROM command. BOM will be automatically detected and ignored if the file

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tom Lane
David E. Wheeler da...@kineticode.com cajw2+qdyg1+xlahdqnjs3ackmcsvcdkv_lcapwutwmxl9dz...@mail.gmail.com writes: On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote: I'm thinking about only COPY FROM for reads, but if someone wants to add BOM in COPY TO, we might also support COPY TO WITH BOM

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tatsuo Ishii
I'd like to support UTF-8 text or csv files that has BOM (byte order mark) in COPY FROM command. BOM will be automatically detected and ignored if the file encoding is UTF-8. WIP patch attached. From RFC3629(http://tools.ietf.org/html/rfc3629#section-6): o A protocol SHOULD forbid use of

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tatsuo Ishii
David E. Wheeler da...@kineticode.com cajw2+qdyg1+xlahdqnjs3ackmcsvcdkv_lcapwutwmxl9dz...@mail.gmail.com writes: On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote: I'm thinking about only COPY FROM for reads, but if someone wants to add BOM in COPY TO, we might also support COPY TO WITH

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Robert Haas
On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii is...@postgresql.org wrote: David E. Wheeler da...@kineticode.com cajw2+qdyg1+xlahdqnjs3ackmcsvcdkv_lcapwutwmxl9dz...@mail.gmail.com writes: On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote: I'm thinking about only COPY FROM for reads, but if

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Robert Haas
On Mon, Sep 26, 2011 at 1:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii is...@postgresql.org wrote: Suppose a user uses brain-dead editor, which does not accept UTF-8 without BOM. Maybe this needs to be an

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii is...@postgresql.org wrote: Suppose a user uses brain-dead editor, which does not accept UTF-8 without BOM. Maybe this needs to be an optional behavior, controlled by some COPY option. I'm not excited

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case). So TI I think we should not regard U+FEFF as BOM in COPY, rather we should TI regard U+FEFF as ZERO WIDTH

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Robert Haas
On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case).  So TI I think we should not regard U+FEFF

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case).  

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Peter Eisentraut
On tis, 2011-09-27 at 00:09 +0900, Tatsuo Ishii wrote: Suppose a user uses brain-dead editor, which does not accept UTF-8 without BOM. I would first like to see evidence that such an editor exists. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Peter Eisentraut
On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case). So TI I think we should not regard U+FEFF as BOM in COPY, rather we should TI regard U+FEFF as

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Robert Haas
On Mon, Sep 26, 2011 at 2:38 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case). So TI I think we should

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Andrew Dunstan
On 09/26/2011 02:38 PM, Peter Eisentraut wrote: On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case). So TI I think we should not regard U+FEFF as BOM

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Peter Eisentraut
On mån, 2011-09-26 at 14:44 -0400, Robert Haas wrote: We did recently accept a patch for psql -f to skip over a UTF-8 byte-order mark. We had a lot of this same discussion there. But that case is different, because zero-width, non-breaking space has no particular meaning in an SQL script

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Brar Piening
Tom Lane wrote: Putting a BOM into UTF8 data is flat out invalid per spec --- the fact that Microsloth does it does not make it standards-conformant. Could you share a pointer to the spec? All I've ever heard is that a BOM is optional for UTF-8 but not forbidden. The Unicode FAQ

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Brar Piening
Tom Lane wrote: Yeah, that's a reasonable argument for rejecting the patch altogether. I'm not qualified to decide whether it outweighs the we need to be able to read Notepad output argument. Actually it's not only notepad. I quite often find myself doing something like the following when

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Brar Piening
Robert Haas wrote: The thing that makes me doubt that is this comment from Tatsuo Ishii: TI COPY explicitly specifies the encoding (to be UTF-8 in this case). So TI I think we should not regard U+FEFF as BOM in COPY, rather we should TI regard U+FEFF as ZERO WIDTH NO-BREAK SPACE. If a BOM

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Tom Lane
Brar Piening b...@gmx.de writes: Citing from the Unicode FAQ again: Q: Where is a BOM useful? A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Brar Piening
Tom Lane wrote: Note that the reference to byte order betrays the implicit context assumption: that we're talking about UTF16 or UTF32 representation. Note that there is no implicit context assumption in the Unicode FAQ. It's equally covering UTF-8, UTF-16 and UTF-32. Another quote: Q: Can a

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-26 Thread Brar Piening
Brar Piening wrote: It's a pity that the Unicode standard actually allows something that can cause problems but blaming the non-platform again doesn't solve the existing issues. To put in a more humoruos but actually correct way: M$ has found a standard conforming way of preventing users

[HACKERS] Support UTF-8 files with BOM in COPY FROM

2011-09-25 Thread Itagaki Takahiro
Hi, I'd like to support UTF-8 text or csv files that has BOM (byte order mark) in COPY FROM command. BOM will be automatically detected and ignored if the file encoding is UTF-8. WIP patch attached. I'm thinking about only COPY FROM for reads, but if someone wants to add BOM in COPY TO, we might