Re: [sqlite] papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Scott Robison Wed, 03 Sep 2014 15:50:10 -0700

On Wed, Sep 3, 2014 at 3:20 PM, Stephan Beal <[email protected]> wrote:

> On Wed, Sep 3, 2014 at 11:13 PM, jose isaias cabrera <
> [email protected]> wrote:
> > PHP should handle the encoding whether or not it has the BOM.
> >
>
> As "should" Excel!
>
> Unlike Excel, with PHP the fix is easy - remove the BOM, which is simple
> once you have a program which lets you know it's there (it's hidden in many
> editors).
>
> My point is only - adding a BOM is not a viable solution: it's a
> deprecated/discouraged/worst-practice because so many tools don't deal well
> with them.
>

The problem is that the recommendations have varied over time (do use it,
don't use it). Win32 was supporting good old 2 byte only Unicode back
before there was a UTF-8 (or indeed before there was a UTF-16). The name
"byte order mark" is a misnomer (though mostly accurate). The BOM is really
a "zero width non breaking space" which is "harmless" at the beginning of a
file and thus useful as a signature to identify the encoding of a file when
other metadata / encoding information is not available. It is useful (if
used) to detect the encoding of a file, but posix systems "wimped out" and
basically never embraced the Unicode specification as originally written,
writing their own encoding, a File System Safe Unicode Transformation
Format (FSS UTF) which later became UTF-8.

Don't get me wrong, I like UTF-8, preferring it to UTF-16. But the
rationale for using ZWNBS as a signature is useful for all UTF encodings
when parties agree to use it as such. When the only Unicode standard was
UCS-2, it was only useful as a signature to determine the byte order of the
originating system. After UTF-8 was introduced, it became useful in its own
right to identify a byte oriented encoding. Then when UCS-2 was effectively
dropped and UTF-16 replaced it (so that the code point space could be
extended from 1 16 bit plane to 17 16 bit planes), and UTF-32 was
introduced, it became even more useful.

In any case: If we can't get all operating systems to agree what code
sequence marks the end of a line of text (CR only, LF only, CR LF,
something else entirely), I don't expect we'll get agreement on the
"proper" way to construct Unicode centric text files any time soon. Both
approaches have pros and cons, though I would maintain that external
metadata that unambiguously identifies the text encoding is by far the best
option, far preferable to guessing, no matter how high the confidence
factor is that the guess is correct.

-- 
Scott Robison
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Reply via email to