Re: [sqlite] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Tim Streater Wed, 28 Jun 2017 13:41:36 -0700

On 28 Jun 2017 at 14:20, Rowan Worth <row...@dug.com> wrote:

> On 27 June 2017 at 18:42, Eric Grange <egra...@glscene.org> wrote:
>
>> So while in theory all the scenarios you describe are interesting, in
>> practice seeing an utf-8 BOM provides an extremely
>> high likeliness that a file will indeed be utf-8. Not always, but a memory
>> chip could also be hit by a cosmic ray.
>>
>> Conversely the absence of an utf-8 BOM means a high probability of
>> "something undetermined": ANSI or BOMless utf-8,
>> or something more oddball (in which I lump utf-16 btw)... and the need for
>> heuristics to kick in.
>>
>
> I think we are largely in agreement here (esp. wrt utf-16 being an oddball
> interchange format).
>
> It doesn't answer my question though, ie. what advantage the BOM tag
> provides compared to assuming utf-8 from the outset. Yes if you see a utf-8
> BOM you have immediate confidence that the data is utf-8 encoded, but what
> have you lost if you start with [fake] confidence and treat the data as
> utf-8 until proven otherwise?


1) Whether the data contained in a file is to be considered UTF-8 or not is an 
item of metadata about the file. As such, it has no business being part of the 
file itself. BOMs should therefore be deprecated.

2) I may receive data as part of an email, with a header such as:

           Content-type: text/plain; charset="utf-8"
           Content-Transfer-Encoding:  base64

then I interpret that to mean that the attendant data, after decoding from 
base64, is it to be expected to be utf-8. The sender, however, could be lying, 
and this needs to be considered. Just because a header, or file metadata, or 
indeed a BOM, says some data or other is legal utf-8, this does not mean that 
it actually is.


--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to