Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 29 Jun 2017, at 20:19, Peter da Silva wrote: The DECsystem 10 guys also referred to the other subdivisions of their 36 bit words as bytes, sometimes, they could be 6, 7, 8, or 9 bits long. I think they had special instructions for operating on them, but they weren’t directly addressable. A byte could be 1..36 bits long. The special instructions used a data structure called a byte pointer to reference the field within a word where the byte was to be placed or retrieved. Four different formats of byte pointer existed, not all supporting the full range of possible byte sizes. One of these days, when I really have too much free time, I must run up a VM with the Panda TOPS-20 distro and find some examples of interesting byte sizes which were actually used for something. 8-) /Niall ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
I always saw byte as something that was relevant for systems that could address objects smaller than words... “byte addressed” machines. The term was mnemonic for something bigger than a bit and smaller than a word. It was usually 8 bits =but there were 36-bit machines that were byte addressable 9 bits at a time. The DECsystem 10 guys also referred to the other subdivisions of their 36 bit words as bytes, sometimes, they could be 6, 7, 8, or 9 bits long. I think they had special instructions for operating on them, but they weren’t directly addressable. There was also a “nibble”, smaller than a “byte”, which was always 4 bits (one hex digit). I don’t think any of the octal people used the word for their three bit digits. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 29 Jun 2017, at 8:01pm, Warren Youngwrote: > We wouldn’t have needed the term “octet” if “byte” always meant “8 bits”. The terms "octet" and "decade" were prevalent across Europe in the 1970s. Certainly we used them both when I was first learning about computers. My impression back then was that the term "byte" was the American word for "octet". The web in general seems to agree with you, not me. It seems that a word was made of bytes, and bytes were made of nybbles, and nybbles were made of bits, and that how many of which went into what depended on which platform you were talking about. This contradicts my computing teacher who taught me the "by eight" definition of "byte". Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Jun 29, 2017, at 11:18 AM, Simon Slavinwrote: > > On 29 Jun 2017, at 5:39pm, Warren Young wrote: > >> Before roughly the mid 1970s, the size of a byte was whatever the computer >> or communications system designer said it was. > > You mean that size of a word. That, too. Again I give the example of a 12-bit PDP-8 storing 6-bit packed ASCII text. The word size is 12, and the byte size is 6. The same machine could instead store 7-bit ASCII from the ASR-33 in its 12-bit words, and we could then speak of 7-bit bytes and 12-bit words. This, too, was a thing in the PDP-8 world, though rarer, since the core memory field size was 4k words, and the base machine config only had the one field, so 5 wasted bits per character was a painful hit. > The word "byte" means "by eight”. I failed to find that in an English corpus search.[1] A search for “by eight” turns up hundreds of results (apparently limited to 600 by the search engine) but none of the matches is near “byte.” A search for “by-eight” turns up only one result, also irrelevant. I suspect the earliest print reference to that definition would be much later than the actual coinage of the word in 1956 by Werner Buchholz, making it a back-formation. I’d expect to find that definition in print only after the microcomputer revolution that nailed the 8-bit byte into place. Further counter-citations: https://stackoverflow.com/questions/13615764/ https://en.wikipedia.org/wiki/Byte#History https://en.wikipedia.org/wiki/Talk:Byte#Byte_.3D_By-Eight.3F https://english.stackexchange.com/questions/121127/etymology-of-byte I wish I could find a copy of Buchholz, W., January 1981: "Origin of the Word 'Byte.'" IEEE Annals of the History of Computing, 3, 1: p. 72 that is not behind a paywall, as Buchholz is the man who coined the word for the IBM 7030 “Stretch,” which had a variable byte size. It used 8-bit bytes for I/O, but it had variable-width bytes internally. We wouldn’t have needed the term “octet” if “byte” always meant “8 bits”. [1]: http://corpus.byu.edu/coca/ > With each bit of storage costing around 100,000 times what they do now A bit of trivia I dropped during editing from the prior post: a 5 MB RK05 disk drive cost about the same as a luxury car. (About US $40,000 today after CPI adjustment.) Cadillac with all the options or RK05? Let me think…RK05! ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Thu, Jun 29, 2017 at 12:18 PM, Simon Slavinwrote: > A couple of minor comments. > > On 29 Jun 2017, at 5:39pm, Warren Young wrote: > > > Before roughly the mid 1970s, the size of a byte was whatever the > computer or communications system designer said it was. > > You mean that size of a word. The word "byte" means "by eight". It did > not always mean 7 bits of data and one parity bit, but it was always 8 bits > in total. > > > A common example would be a Teletype Model 33 ASR hardwired by DEC for > transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity > > Thank you for mentioning that. First computer terminal I ever used. I > think I still have some of the paper tape somewhere. > > > The 8-bit byte standard — and its even multiples — is relatively recent > in computing history. You can point to early examples like the 32-bit IBM > 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I > believe it was the flood of 8-bit microcomputers in the mid to late 1970s > that finally and firmly associated “byte” with “8 bits”. > > Again, the word you want is "word". There were architectures with all > sorts of weird word sizes. "byte" always meant "by eight" and was a > synonym for "octet". > > As Warren wrote, words did not always encode text as 8 bits per > character. Computers with 16-bit word sizes might encode ASCII as three > 5-bit characters plus a parity bit, or use two 16-bit words for five 6-bit > characters plus 2 meta-bits. With each bit of storage costing around > 100,000 times what they do now, and taking 10,000 times the time to move > across your communications network, there was a wide variety of ingenious > ways to save a bit here and a bit there. > > Simon. > > In today's world, you are completely correct. However, according to Wikipedia (https://en.wikipedia.org/wiki/Byte_addressing), there was at least one machine (Honeywell) which had a 36 bit word which was divided into 9 bit "bytes" (i.e. an address pointed to a 9 bit "byte"). -- Veni, Vidi, VISA: I came, I saw, I did a little shopping. Maranatha! <>< John McKown ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 29 Jun 2017, at 6:18pm, Simon Slavinwrote: > Computers with 16-bit word sizes might encode ASCII as three 5-bit characters Where I wrote "ASCII" I should have written "text". Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
A couple of minor comments. On 29 Jun 2017, at 5:39pm, Warren Youngwrote: > Before roughly the mid 1970s, the size of a byte was whatever the computer or > communications system designer said it was. You mean that size of a word. The word "byte" means "by eight". It did not always mean 7 bits of data and one parity bit, but it was always 8 bits in total. > A common example would be a Teletype Model 33 ASR hardwired by DEC for > transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity Thank you for mentioning that. First computer terminal I ever used. I think I still have some of the paper tape somewhere. > The 8-bit byte standard — and its even multiples — is relatively recent in > computing history. You can point to early examples like the 32-bit IBM 360 > and later ones like the 16-bit Data General Nova and DEC PDP-11, but I > believe it was the flood of 8-bit microcomputers in the mid to late 1970s > that finally and firmly associated “byte” with “8 bits”. Again, the word you want is "word". There were architectures with all sorts of weird word sizes. "byte" always meant "by eight" and was a synonym for "octet". As Warren wrote, words did not always encode text as 8 bits per character. Computers with 16-bit word sizes might encode ASCII as three 5-bit characters plus a parity bit, or use two 16-bit words for five 6-bit characters plus 2 meta-bits. With each bit of storage costing around 100,000 times what they do now, and taking 10,000 times the time to move across your communications network, there was a wide variety of ingenious ways to save a bit here and a bit there. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Jun 27, 2017, at 3:02 PM, Keith Medcalfwrote: > >> The whole point of >> specifying a format as 7 bits is that the 8th bit is ignored, or >> perhaps used in an implementation-defined manner, regardless of whether >> the 8th bit in a char is available or not. > > ASCII was designed back in the days of low reliability serial communications > -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 > stop bits -- to increase the reliability of the communications. A "byte" was > also 9 bits. 8 bits of data and a parity bit. Before roughly the mid 1970s, the size of a byte was whatever the computer or communications system designer said it was. Even within a single computer + serial comm system, the definitions could differ. For this reason, we also have the term “octet,” which unambiguously means an 8-bit unit of data. The 9-bit byte is largely a DEC-ism, since their pre-PDP-11 machines used a word size that was an integer multiple of 6 or 12. DEC had 12-bit machines, 18-bit machines, and 36-bit machines. There was even a plan for a 24-bit design at one point. A common example would be a Teletype Model 33 ASR hardwired by DEC for transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity, fed by a 12-bit PDP-8 pulling that text off an RK05 cartridge disk from a file encoded in a 6-bit packed ASCII format. 6-bit packed ASCII schemes were common at the time: to efficiently store plain text in the native 12-, 18-, or 36-bit words, programmers would drop most of the control characters and punctuation, as well as either dropping or shift-encoding lowercase. That isn’t an innovation from the DEC world, either: Émile Baudot came up with basically the same idea in his eponymous 5-bit telegraph code in 1870. You could well say that Baudot code uses 5-bit bytes. (This is also where the data communications unit “baud” comes from.) The 8-bit byte standard — and its even multiples — is relatively recent in computing history. You can point to early examples like the 32-bit IBM 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I believe it was the flood of 8-bit microcomputers in the mid to late 1970s that finally and firmly associated “byte” with “8 bits”. > Nowadays we use 8 bits for data with no parity True parity bits (as opposed to mark/space parity) can only detect a 1-bit error. We dropped parity checks when the data rates rose and SNR levels fell to the point that single-bit errors were a frequent occurrence, making parity checks practically useless. > no error correction The wonder, to my mind, is that it’s still an argument whether to use ECC RAM in any but the lowest-end machines. You should have the option to put ECC RAM into any machine down to about the $500 level by simply paying a ~25% premium on the option cost for non-ECC RAM, but artificial market segmentation has kept ECC a feature of the server and high-end PC worlds only. This sort of penny-pinching should have gone out of style in the 1990s, for the same reason Ethernet and USB use smarter error correction than did RS-232. We should have flowed from parity RAM at the high end to ECC RAM at the high end to ECC everywhere by now. > and no timing bits. Timing bits aren’t needed when you have clock recovery hardware, which like ECC, is a superior technology that should be universal once transistors become sufficiently cheap. Clock recovery becomes necessary once SNR levels get to the point they are now, where separate clock lines don’t really help any more. You’d have to apply clock recovery type techniques to the clock line if you had it, so you might as well apply it to the data and leave the clock line out. > Cuz when things screw up we want them to REALLY screw up ... and remain > undetectable. Thus the move toward strongly checksummed filesystems like ZFS, btrfs, HAMMER, APFS, and ReFS. Like ECC, this is a battle that should be over by now, but we’re going to see HFS+, NTFS, and extfs hang on for a long time yet because $REASONS. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 27 June 2017 at 18:42, Eric Grangewrote: > So while in theory all the scenarios you describe are interesting, in > practice seeing an utf-8 BOM provides an extremely > high likeliness that a file will indeed be utf-8. Not always, but a memory > chip could also be hit by a cosmic ray. > > Conversely the absence of an utf-8 BOM means a high probability of > "something undetermined": ANSI or BOMless utf-8, > or something more oddball (in which I lump utf-16 btw)... and the need for > heuristics to kick in. > I think we are largely in agreement here (esp. wrt utf-16 being an oddball interchange format). It doesn't answer my question though, ie. what advantage the BOM tag provides compared to assuming utf-8 from the outset. Yes if you see a utf-8 BOM you have immediate confidence that the data is utf-8 encoded, but what have you lost if you start with [fake] confidence and treat the data as utf-8 until proven otherwise? Either the data is utf-8, or ASCII, or ANSI but with no high-bit characters and everything works, or you find an invalid byte sequence which gives you high confidence that this is not actually utf-8 data. Granted it requires more than three bytes lookahead, but we're gonig to be using that data anyway. I guess the one clear advantage I see of a utf-8 BOM is that it can simplify some code, and reduce some duplicate work when interfacing with APIs which both require a text encoding specified up-front and don't offer a convenient error path when decoding fails. But adding utf-8 with BOM as yet another text encoding configuration to the landscape seems like a high price to pay, and certainly not an overall simplification. Outside of source code and Linux config files, BOMless utf-8 are certainly > not the most frequent text files, ANSI and > other various encodings dominate, because most non-ASCII text files were > (are) produced under DOS or Windows, > where notepad and friends use ANSI by default f.i. > Notepad barely counts as a text editor (newlines are always two bytes long yeah? :P), but I take your point that ANSI is common (especially CP1251?). I've honestly never seen a utf-8 file *with* a BOM though, so perhaps I've lived a sheltered life. I'm not sure what you were going for here: the overwhelming majority of text content are likely to involve ASCII at the beginning (from various markups, think html, xml, json, source code... even > csv Since HTML's encoding is generally specified in the HTTP header or metadata. XML's encoding must be specified on the first line (unless the default utf-8 is used or a BOM is present). JSON's encoding must be either utf-8, utf-16 or utf-32. Source code encoding is generally defined by the language in question. That may not be a desirable or happy situation, but that is the situation > we have to deal with. > True, we're stuck with decisions of the past. I guess (and maybe I've finally understood your position?) if a BOM was mandated for _all_ utf-8 data from the outset to clearly distinguish it from pre-existing ANSI codepages then I could see its value. Although I remain a little revulsed by having those three little bytes at the front of all my files to solve what is predominently a transport issue ;) -Rowan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 6/27/17, 4:02 PM, "sqlite-users on behalf of Keith Medcalf"wrote: > Nowadays we use 8 bits for data with no parity, no error correction, and no > timing bits. Cuz when things screw up we want them to REALLY screw up ... > and remain undetectable. Nowadays we use packet checksums and retransmission of corrupted or missing packets. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, Jun 27, 2017 at 4:02 PM, Keith Medcalfwrote: > > > If an implementation "uses" 8 bits for ASCII text (as opposed to > > hardware storage which is never less than 8 bits for a single C char, > > AFAIK), then it is not a valid ASCII implementation, i.e. does not > > interpret ASCII according to its definition. The whole point of > > specifying a format as 7 bits is that the 8th bit is ignored, or > > perhaps used in an implementation-defined manner, regardless of whether > > the 8th bit in a char is available or not. > > ASCII was designed back in the days of low reliability serial > communications -- you know, back when data was sent using 7 bit data + 1 > parity bits + 2 stop bits -- to increase the reliability of the > communications. A "byte" was also 9 bits. 8 bits of data and a parity bit. > > Nowadays we use 8 bits for data with no parity, no error correction, and > no timing bits. Cuz when things screw up we want them to REALLY screw up > ... and remain undetectable. > Actually, most _enterprise_ level storage & transmission facilities have error detection and correction codes which are "transparent" to the programmer. Almost everybody knows about RAID arrays which (other than JBOD) have either "parity" (RAID5 is an example) or is "mirrored" (RAID1). Most have also heard of ECC RAM memory. But I'll bet that few have heard of RAIM memory, which is used on the IBM z series of computers. Redundant Array of Independent Memory. This is basically "RAID 5" memory. In addition to the RAID-ness, it still uses ECC as well. Also, unlike with an Intel machine, if an IBM z suffers a "memory failure", there is usually the ability for the _hardware_ to recover all the data in the memory module ("block") and transparently copy it to a "phantom" block of memory, which then takes the place of the block which contains the error. All without host software intervention. https://www.ibm.com/developerworks/community/blogs/e0c474f8-3aad-4f01-8bca-f2c12b576ac9/entry/IBM_zEnterprise_redundant_array_of_independent_memory_subsystem ? -- Veni, Vidi, VISA: I came, I saw, I did a little shopping. Maranatha! <>< John McKown ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> If an implementation "uses" 8 bits for ASCII text (as opposed to > hardware storage which is never less than 8 bits for a single C char, > AFAIK), then it is not a valid ASCII implementation, i.e. does not > interpret ASCII according to its definition. The whole point of > specifying a format as 7 bits is that the 8th bit is ignored, or > perhaps used in an implementation-defined manner, regardless of whether > the 8th bit in a char is available or not. ASCII was designed back in the days of low reliability serial communications -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop bits -- to increase the reliability of the communications. A "byte" was also 9 bits. 8 bits of data and a parity bit. Nowadays we use 8 bits for data with no parity, no error correction, and no timing bits. Cuz when things screw up we want them to REALLY screw up ... and remain undetectable. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 16:38 +0200, Eric Grange wrote: > > > > ASCII / ANSI is a 7-bit format. > ASCII is a 7 bit encoding, but uses 8 bits in just about any > implementation > out there. I do not think there is any 7 bit implementation still > alive > outside of legacy mode for low-level wire protocols (RS232 etc.). I > personally have never encountered a 7 bit ASCII file (as in > bitpacked), I > am curious if any exists? If an implementation "uses" 8 bits for ASCII text (as opposed to hardware storage which is never less than 8 bits for a single C char, AFAIK), then it is not a valid ASCII implementation, i.e. does not interpret ASCII according to its definition. The whole point of specifying a format as 7 bits is that the 8th bit is ignored, or perhaps used in an implementation-defined manner, regardless of whether the 8th bit in a char is available or not. Once an encoding embraces 8 bits, it will be something like CP1252, ISO-8859-x, KOI-R, etc. Just not ASCII. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> ASCII / ANSI is a 7-bit format. ASCII is a 7 bit encoding, but uses 8 bits in just about any implementation out there. I do not think there is any 7 bit implementation still alive outside of legacy mode for low-level wire protocols (RS232 etc.). I personally have never encountered a 7 bit ASCII file (as in bitpacked), I am curious if any exists? ANSI has no precise definition, it's used to lump together all the <= 8 bit legacy encodings (cf. https://en.wikipedia.org/wiki/ANSI_character_set) On Tue, Jun 27, 2017 at 1:53 PM, Simon Slavinwrote: > > > On 27 Jun 2017, at 7:12am, Rowan Worth wrote: > > > In fact using this assumption we could dispense with the BOM entirely for > > UTF-8 and drop case 5 from the list. > > If you do that, you will try to process the BOM at the beginning of a > UTF-8 stream as if it is characters. > > > So my question is, what advantage does > > a BOM offer for UTF-8? What other cases can we identify with the > > information it provides? > > Suppose your software processes only UTF-8 files, but someone feeds it a > file which begins with FE FF. Your software should recognise this and > reject the file, telling the user/programmer that it can’t process it > because it’s in the wrong encoding. > > Processing BOMs is part of the work you have to do to make your software > Unicode-aware. Without it, your documentation should state that your > software handles the one flavour of Unicode it handles, not Unicode in > general. There’s nothing wrong with this, if it’s all the programmer/user > needs, as long as it’s correctly documented. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 27 Jun 2017, at 7:12am, Rowan Worthwrote: > In fact using this assumption we could dispense with the BOM entirely for > UTF-8 and drop case 5 from the list. If you do that, you will try to process the BOM at the beginning of a UTF-8 stream as if it is characters. > So my question is, what advantage does > a BOM offer for UTF-8? What other cases can we identify with the > information it provides? Suppose your software processes only UTF-8 files, but someone feeds it a file which begins with FE FF. Your software should recognise this and reject the file, telling the user/programmer that it can’t process it because it’s in the wrong encoding. Processing BOMs is part of the work you have to do to make your software Unicode-aware. Without it, your documentation should state that your software handles the one flavour of Unicode it handles, not Unicode in general. There’s nothing wrong with this, if it’s all the programmer/user needs, as long as it’s correctly documented. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 12:42 +0200, Eric Grange wrote: > In the real world, text files are heavily skewed towards 8 bit > formats, > meaning just three cases dominate the debate: > - ASCII / ANSI > - utf-8 with BOM > - utf-8 without BOM ASCII / ANSI is a 7-bit format. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> In case 7 we have little choice but to invoke heuristics or defer to the > user, yes? Yes in theory, but "no" in the real world, or rather "not in any way that matters" In the real world, text files are heavily skewed towards 8 bit formats, meaning just three cases dominate the debate: - ASCII / ANSI - utf-8 with BOM - utf-8 without BOM And further, the overwhelming majority of text content are likely to involve ASCII at the beginning (from various markups, think html, xml, json, source code... even csv, because of explicit separator specification or 1st column name). So while in theory all the scenarios you describe are interesting, in practice seeing an utf-8 BOM provides an extremely high likeliness that a file will indeed be utf-8. Not always, but a memory chip could also be hit by a cosmic ray. Conversely the absence of an utf-8 BOM means a high probability of "something undetermined": ANSI or BOMless utf-8, or something more oddball (in which I lump utf-16 btw)... and the need for heuristics to kick in. Outside of source code and Linux config files, BOMless utf-8 are certainly not the most frequent text files, ANSI and other various encodings dominate, because most non-ASCII text files were (are) produced under DOS or Windows, where notepad and friends use ANSI by default f.i. That may not be a desirable or happy situation, but that is the situation we have to deal with. It is also the reason why 20 years later the utf-8 BOM is still in use: it explicit and has a practical success rate higher than any of the heuristics, while the collisions of the BOM with actual ANSI (or other) text start are unheard of. On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrovewrote: > On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > > The original issue was two of the largest companies in the world > > output the > > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of > > UTF-8 > > encoded text streams, and it would be friendly for the SQLite3 shell > > to > > skip it or use it for encoding identification in at least some cases. > > I would suggest adding a command-line argument to the shell indicating > whether to ignore a BOM or not, possibly requiring specification of a > certain encoding or list of encodings to consider. > > Certainly this should not be a requirement for the library per se, but > a responsibility of the client to provide data in the proper encoding. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > The original issue was two of the largest companies in the world > output the > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of > UTF-8 > encoded text streams, and it would be friendly for the SQLite3 shell > to > skip it or use it for encoding identification in at least some cases. I would suggest adding a command-line argument to the shell indicating whether to ignore a BOM or not, possibly requiring specification of a certain encoding or list of encodings to consider. Certainly this should not be a requirement for the library per se, but a responsibility of the client to provide data in the proper encoding. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > On Jun 27, 2017 12:13 AM, "Rowan Worth"wrote: > > I'm sure I've simplified things with this description - have I missed > something crucial? Is the BOM argument about future proofing? Are we > worried about EBCDIC? Is my perspective too anglo-centric? Thanks, Scott -- nothing crucial, it is already quite good enough for 99% of use cases. The Wikipedia page on "Byte Order Marks" appears to be quite comprehensive and lists about a dozen possible BOM sequences: https://en.wikipedia.org/wiki/Byte_order_mark Lacking a BOM, I would certainly try to rule out UTF-8 right away by searching for invalid UTF-8 characters within a reasonably large portion of the input (maybe 100-300KB?) before then looking for any NULL bytes (which are also invalid UTF-8 except as a delimiter) or other random control characters. As to having the user specify an encoding when dealing with something which should be text (CSV files, for example) and processing files which the user has specified, there is always the possibility that the encoding is different than what the user says, mainly because they probably clicked on a spreadsheet file with a similar name instead of the desired text file. If the user specifies an 8-bit encoding aside from Unicode, it gets very difficult to trap wrong input unless you write routines to search for invalid characters (e.g. distinguishing between true ISO-8859-x and CP1252). ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Jun 27, 2017 12:13 AM, "Rowan Worth"wrote: I'm sure I've simplified things with this description - have I missed something crucial? Is the BOM argument about future proofing? Are we worried about EBCDIC? Is my perspective too anglo-centric? The original issue was two of the largest companies in the world output the Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of UTF-8 encoded text streams, and it would be friendly for the SQLite3 shell to skip it or use it for encoding identification in at least some cases. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 26 June 2017 at 19:03, Eric Grangewrote: > No BOM = you have to fire a whole suite of heuristics or present the user > with choices he/she will not understand. > Requiring heuristics to determine text encoding/codepage exists regardless of whether BOM is used since the problem predates unicode altogether. Lets consider the scenarios - as Simon enumerated we're roughly interested in cases where the data stream begins with byte sequences: (1) 0x00 0x00 0xFE 0xFF (2) 0xFF 0xFE 0x00 0x00 (3) 0xFE 0xFF (4) 0xFF 0xFE (5) 0xEF 0xBB 0xBF (6) anything else (and datastream is ASCII or UTF-8) (7) anything else (and datastream is some random codepage) In case 7 we have little choice but to invoke heuristics or defer to the user, yes? For the first 5 cases we can immediately deduce some facts: (1) -> almost certainly UTF-32BE, athough if NUL characters may be present 8-bit codepages are still a candidate (2) -> almost certainly UTF-32LE, although if NUL characters may be present UTF-16LE and 8-bit codepages are still a candidate (3) -> likely UTF16-BE, but could be some other 8-bit codepage (4) -> likely UTF16-LE, but could be some other 8-bit codepage (5) -> almost certainly UTF-8, but could be some other 8-bit codepage I observe that BOM never provides perfect confidence regarding the encoding, although in practice I expect it would only fail on data specifically designed to fool it. I also suggest that the checks ought to be performed in the order listed, to avoid categorising UTF-32LE text as UTF-16LE, and because the first 4 cases rule out [valid] UTF-8 data. Now lets say we make an assumption that all text is UTF-8 until proven otherwise. In case 6 we get lucky and everything works, and in case 7 we find invalid characters and fall back to heuristics or the user to identify the encoding. In fact using this assumption we could dispense with the BOM entirely for UTF-8 and drop case 5 from the list. So my question is, what advantage does a BOM offer for UTF-8? What other cases can we identify with the information it provides? If you were going to jump straight from case 5 to case 7 in the absence of a BOM it seems like you might aswell give UTF-8 a try since it and ASCII are far and away the common case. I'm sure I've simplified things with this description - have I missed something crucial? Is the BOM argument about future proofing? Are we worried about EBCDIC? Is my perspective too anglo-centric? After 20 years, the choice is between doing the best in an imperfect world, > or perpetuating the issue and blaming others. > By being scalable and general enough to represent all desired characters, as I see it UTF-8 is not perpetuating any issues but rather offering an out from historic codepage woes (by adopting it as the go-to interchange format). As Peter da Silva said: > It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage. -Rowan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users