Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Rowan Worth Mon, 26 Jun 2017 23:14:07 -0700

On 26 June 2017 at 19:03, Eric Grange <egra...@glscene.org> wrote:

> No BOM = you have to fire a whole suite of heuristics or present the user
> with choices he/she will not understand.
>


Requiring heuristics to determine text encoding/codepage exists regardless
of whether BOM is used since the problem predates unicode altogether. Lets
consider the scenarios - as Simon enumerated we're roughly interested in
cases where the data stream begins with byte sequences:

(1) 0x00 0x00 0xFE 0xFF
(2) 0xFF 0xFE 0x00 0x00
(3) 0xFE 0xFF
(4) 0xFF 0xFE
(5) 0xEF 0xBB 0xBF
(6) anything else (and datastream is ASCII or UTF-8)
(7) anything else (and datastream is some random codepage)

In case 7 we have little choice but to invoke heuristics or defer to the
user, yes? For the first 5 cases we can immediately deduce some facts:

(1) -> almost certainly UTF-32BE, athough if NUL characters may be present
8-bit codepages are still a candidate
(2) -> almost certainly UTF-32LE, although if NUL characters may be present
UTF-16LE and 8-bit codepages are still a candidate
(3) -> likely UTF16-BE, but could be some other 8-bit codepage
(4) -> likely UTF16-LE, but could be some other 8-bit codepage
(5) -> almost certainly UTF-8, but could be some other 8-bit codepage

I observe that BOM never provides perfect confidence regarding the
encoding, although in practice I expect it would only fail on data
specifically designed to fool it.

I also suggest that the checks ought to be performed in the order listed,
to avoid categorising UTF-32LE text as UTF-16LE, and because the first 4
cases rule out [valid] UTF-8 data.


Now lets say we make an assumption that all text is UTF-8 until proven
otherwise. In case 6 we get lucky and everything works, and in case 7 we
find invalid characters and fall back to heuristics or the user to identify
the encoding.

In fact using this assumption we could dispense with the BOM entirely for
UTF-8 and drop case 5 from the list. So my question is, what advantage does
a BOM offer for UTF-8? What other cases can we identify with the
information it provides?

If you were going to jump straight from case 5 to case 7 in the absence of
a BOM it seems like you might aswell give UTF-8 a try since it and ASCII
are far and away the common case.


I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?

After 20 years, the choice is between doing the best in an imperfect world,
> or perpetuating the issue and blaming others.
>

By being scalable and general enough to represent all desired characters,
as I see it UTF-8 is not perpetuating any issues but rather offering an out
from historic codepage woes (by adopting it as the go-to interchange
format).

As Peter da Silva said:
> It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage.

-Rowan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to