On 28 Jun 2017 at 14:20, Rowan Worth <row...@dug.com> wrote: > On 27 June 2017 at 18:42, Eric Grange <egra...@glscene.org> wrote: > >> So while in theory all the scenarios you describe are interesting, in >> practice seeing an utf-8 BOM provides an extremely >> high likeliness that a file will indeed be utf-8. Not always, but a memory >> chip could also be hit by a cosmic ray. >> >> Conversely the absence of an utf-8 BOM means a high probability of >> "something undetermined": ANSI or BOMless utf-8, >> or something more oddball (in which I lump utf-16 btw)... and the need for >> heuristics to kick in. >> > > I think we are largely in agreement here (esp. wrt utf-16 being an oddball > interchange format). > > It doesn't answer my question though, ie. what advantage the BOM tag > provides compared to assuming utf-8 from the outset. Yes if you see a utf-8 > BOM you have immediate confidence that the data is utf-8 encoded, but what > have you lost if you start with [fake] confidence and treat the data as > utf-8 until proven otherwise?
1) Whether the data contained in a file is to be considered UTF-8 or not is an item of metadata about the file. As such, it has no business being part of the file itself. BOMs should therefore be deprecated. 2) I may receive data as part of an email, with a header such as: Content-type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 then I interpret that to mean that the attendant data, after decoding from base64, is it to be expected to be utf-8. The sender, however, could be lying, and this needs to be considered. Just because a header, or file metadata, or indeed a BOM, says some data or other is legal utf-8, this does not mean that it actually is. -- Cheers -- Tim _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users