Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Robert Hairgrove Tue, 27 Jun 2017 01:31:30 -0700

On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> On Jun 27, 2017 12:13 AM, "Rowan Worth" <row...@dug.com> wrote:
> 
> I'm sure I've simplified things with this description - have I missed
> something crucial? Is the BOM argument about future proofing? Are we
> worried about EBCDIC? Is my perspective too anglo-centric?


Thanks, Scott -- nothing crucial, it is already quite good enough for
99% of use cases.

The Wikipedia page on "Byte Order Marks" appears to be quite
comprehensive and lists about a dozen possible BOM sequences:

https://en.wikipedia.org/wiki/Byte_order_mark

Lacking a BOM, I would certainly try to rule out UTF-8 right away by
searching for invalid UTF-8 characters within a reasonably large
portion of the input (maybe 100-300KB?) before then looking for any
NULL bytes (which are also invalid UTF-8 except as a delimiter) or
other random control characters.

As to having the user specify an encoding when dealing with something
which should be text (CSV files, for example) and processing files
which the user has specified, there is always the possibility that the
encoding is different than what the user says, mainly because they
probably clicked on a spreadsheet file with a similar name instead of
the desired text file. If the user specifies an 8-bit encoding aside
from Unicode, it gets very difficult to trap wrong input unless you
write routines to search for invalid characters (e.g. distinguishing
between true ISO-8859-x and CP1252).
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to