On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > On Jun 27, 2017 12:13 AM, "Rowan Worth" <row...@dug.com> wrote: > > I'm sure I've simplified things with this description - have I missed > something crucial? Is the BOM argument about future proofing? Are we > worried about EBCDIC? Is my perspective too anglo-centric?
Thanks, Scott -- nothing crucial, it is already quite good enough for 99% of use cases. The Wikipedia page on "Byte Order Marks" appears to be quite comprehensive and lists about a dozen possible BOM sequences: https://en.wikipedia.org/wiki/Byte_order_mark Lacking a BOM, I would certainly try to rule out UTF-8 right away by searching for invalid UTF-8 characters within a reasonably large portion of the input (maybe 100-300KB?) before then looking for any NULL bytes (which are also invalid UTF-8 except as a delimiter) or other random control characters. As to having the user specify an encoding when dealing with something which should be text (CSV files, for example) and processing files which the user has specified, there is always the possibility that the encoding is different than what the user says, mainly because they probably clicked on a spreadsheet file with a similar name instead of the desired text file. If the user specifies an 8-bit encoding aside from Unicode, it gets very difficult to trap wrong input unless you write routines to search for invalid characters (e.g. distinguishing between true ISO-8859-x and CP1252). _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users