On 26 June 2017 at 19:03, Eric Grange <egra...@glscene.org> wrote: > No BOM = you have to fire a whole suite of heuristics or present the user > with choices he/she will not understand. >
Requiring heuristics to determine text encoding/codepage exists regardless of whether BOM is used since the problem predates unicode altogether. Lets consider the scenarios - as Simon enumerated we're roughly interested in cases where the data stream begins with byte sequences: (1) 0x00 0x00 0xFE 0xFF (2) 0xFF 0xFE 0x00 0x00 (3) 0xFE 0xFF (4) 0xFF 0xFE (5) 0xEF 0xBB 0xBF (6) anything else (and datastream is ASCII or UTF-8) (7) anything else (and datastream is some random codepage) In case 7 we have little choice but to invoke heuristics or defer to the user, yes? For the first 5 cases we can immediately deduce some facts: (1) -> almost certainly UTF-32BE, athough if NUL characters may be present 8-bit codepages are still a candidate (2) -> almost certainly UTF-32LE, although if NUL characters may be present UTF-16LE and 8-bit codepages are still a candidate (3) -> likely UTF16-BE, but could be some other 8-bit codepage (4) -> likely UTF16-LE, but could be some other 8-bit codepage (5) -> almost certainly UTF-8, but could be some other 8-bit codepage I observe that BOM never provides perfect confidence regarding the encoding, although in practice I expect it would only fail on data specifically designed to fool it. I also suggest that the checks ought to be performed in the order listed, to avoid categorising UTF-32LE text as UTF-16LE, and because the first 4 cases rule out [valid] UTF-8 data. Now lets say we make an assumption that all text is UTF-8 until proven otherwise. In case 6 we get lucky and everything works, and in case 7 we find invalid characters and fall back to heuristics or the user to identify the encoding. In fact using this assumption we could dispense with the BOM entirely for UTF-8 and drop case 5 from the list. So my question is, what advantage does a BOM offer for UTF-8? What other cases can we identify with the information it provides? If you were going to jump straight from case 5 to case 7 in the absence of a BOM it seems like you might aswell give UTF-8 a try since it and ASCII are far and away the common case. I'm sure I've simplified things with this description - have I missed something crucial? Is the BOM argument about future proofing? Are we worried about EBCDIC? Is my perspective too anglo-centric? After 20 years, the choice is between doing the best in an imperfect world, > or perpetuating the issue and blaming others. > By being scalable and general enough to represent all desired characters, as I see it UTF-8 is not perpetuating any issues but rather offering an out from historic codepage woes (by adopting it as the go-to interchange format). As Peter da Silva said: > It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage. -Rowan _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users