Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-30 Thread Niall O'Reilly

On 29 Jun 2017, at 20:19, Peter da Silva wrote:

The DECsystem 10 guys also referred to the other subdivisions of their 
36 bit words as bytes, sometimes, they could be 6, 7, 8, or 9 bits 
long. I think they had special instructions for operating on them, but 
they weren’t directly addressable.


  A byte could be 1..36 bits long.

  The special instructions used a data structure called a byte pointer
  to reference the field within a word where the byte was to be placed
  or retrieved.  Four different formats of byte pointer existed, not 
all

  supporting the full range of possible byte sizes.

  One of these days, when I really have too much free time, I must run
  up a VM with the Panda TOPS-20 distro and find some examples of
  interesting byte sizes which were actually used for something. 8-)

  /Niall
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Peter da Silva
I always saw byte as something that was relevant for systems that could address 
objects smaller than words... “byte addressed” machines. The term was mnemonic 
for something bigger than a bit and smaller than a word. It was usually 8 bits 
=but there were 36-bit machines that were byte addressable 9 bits at a time. 
The DECsystem 10 guys also referred to the other subdivisions of their 36 bit 
words as bytes, sometimes, they could be 6, 7, 8, or 9 bits long. I think they 
had special instructions for operating on them, but they weren’t directly 
addressable.

There was also a “nibble”, smaller than a “byte”, which was always 4 bits (one 
hex digit). I don’t think any of the octal people used the word for their three 
bit digits.
 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Simon Slavin


On 29 Jun 2017, at 8:01pm, Warren Young  wrote:

> We wouldn’t have needed the term “octet” if “byte” always meant “8 bits”.

The terms "octet" and "decade" were prevalent across Europe in the 1970s.  
Certainly we used them both when I was first learning about computers.  My 
impression back then was that the term "byte" was the American word for "octet".

The web in general seems to agree with you, not me.  It seems that a word was 
made of bytes, and bytes were made of nybbles, and nybbles were made of bits, 
and that how many of which went into what depended on which platform you were 
talking about.  This contradicts my computing teacher who taught me the "by 
eight" definition of "byte".

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Warren Young
On Jun 29, 2017, at 11:18 AM, Simon Slavin  wrote:
> 
> On 29 Jun 2017, at 5:39pm, Warren Young  wrote:
> 
>> Before roughly the mid 1970s, the size of a byte was whatever the computer 
>> or communications system designer said it was.
> 
> You mean that size of a word.

That, too.  Again I give the example of a 12-bit PDP-8 storing 6-bit packed 
ASCII text.  The word size is 12, and the byte size is 6.

The same machine could instead store 7-bit ASCII from the ASR-33 in its 12-bit 
words, and we could then speak of 7-bit bytes and 12-bit words.  This, too, was 
a thing in the PDP-8 world, though rarer, since the core memory field size was 
4k words, and the base machine config only had the one field, so 5 wasted bits 
per character was a painful hit.

> The word "byte" means "by eight”.

I failed to find that in an English corpus search.[1]  A search for “by eight” 
turns up hundreds of results (apparently limited to 600 by the search engine) 
but none of the matches is near “byte.”  A search for “by-eight” turns up only 
one result, also irrelevant.

I suspect the earliest print reference to that definition would be much later 
than the actual coinage of the word in 1956 by Werner Buchholz, making it a 
back-formation.  I’d expect to find that definition in print only after the 
microcomputer revolution that nailed the 8-bit byte into place.

Further counter-citations:

   https://stackoverflow.com/questions/13615764/
   https://en.wikipedia.org/wiki/Byte#History
   https://en.wikipedia.org/wiki/Talk:Byte#Byte_.3D_By-Eight.3F
   https://english.stackexchange.com/questions/121127/etymology-of-byte

I wish I could find a copy of 

   Buchholz, W., January 1981:
   "Origin of the Word 'Byte.'" 
   IEEE Annals of the History of Computing, 3, 1: p. 72 

that is not behind a paywall, as Buchholz is the man who coined the word for 
the IBM 7030 “Stretch,” which had a variable byte size.  It used 8-bit bytes 
for I/O, but it had variable-width bytes internally.

We wouldn’t have needed the term “octet” if “byte” always meant “8 bits”.


[1]: http://corpus.byu.edu/coca/

> With each bit of storage costing around 100,000 times what they do now

A bit of trivia I dropped during editing from the prior post: a 5 MB RK05 disk 
drive cost about the same as a luxury car.  (About US $40,000 today after CPI 
adjustment.)

Cadillac with all the options or RK05?  Let me think…RK05!
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread John McKown
On Thu, Jun 29, 2017 at 12:18 PM, Simon Slavin  wrote:

> A couple of minor comments.
>
> On 29 Jun 2017, at 5:39pm, Warren Young  wrote:
>
> > Before roughly the mid 1970s, the size of a byte was whatever the
> computer or communications system designer said it was.
>
> You mean that size of a word.  The word "byte" means "by eight".  It did
> not always mean 7 bits of data and one parity bit, but it was always 8 bits
> in total.
>
> > A common example would be a Teletype Model 33 ASR hardwired by DEC for
> transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity
>
> Thank you for mentioning that.  First computer terminal I ever used.  I
> think I still have some of the paper tape somewhere.
>
> > The 8-bit byte standard — and its even multiples — is relatively recent
> in computing history.  You can point to early examples like the 32-bit IBM
> 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I
> believe it was the flood of 8-bit microcomputers in the mid to late 1970s
> that finally and firmly associated “byte” with “8 bits”.
>
> Again, the word you want is "word".  There were architectures with all
> sorts of weird word sizes.  "byte" always meant "by eight" and was a
> synonym for "octet".
>
> As Warren wrote, words did not always encode text as 8 bits per
> character.  Computers with 16-bit word sizes might encode ASCII as three
> 5-bit characters plus a parity bit, or use two 16-bit words for five 6-bit
> characters plus 2 meta-bits.  With each bit of storage costing around
> 100,000 times what they do now, and taking 10,000 times the time to move
> across your communications network, there was a wide variety of ingenious
> ways to save a bit here and a bit there.
>
> Simon.
>
>
​In today's world, you are completely correct. However, according to
Wikipedia (https://en.wikipedia.org/wiki/Byte_addressing), there was at
least one machine (Honeywell) which had a 36 bit word which was divided
into 9 bit "bytes" (i.e. an address pointed to a 9 bit "byte").​


-- 
Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Maranatha! <><
John McKown
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Simon Slavin


On 29 Jun 2017, at 6:18pm, Simon Slavin  wrote:

> Computers with 16-bit word sizes might encode ASCII as three 5-bit characters

Where I wrote "ASCII" I should have written "text".

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Simon Slavin
A couple of minor comments.

On 29 Jun 2017, at 5:39pm, Warren Young  wrote:

> Before roughly the mid 1970s, the size of a byte was whatever the computer or 
> communications system designer said it was.

You mean that size of a word.  The word "byte" means "by eight".  It did not 
always mean 7 bits of data and one parity bit, but it was always 8 bits in 
total.

> A common example would be a Teletype Model 33 ASR hardwired by DEC for 
> transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity

Thank you for mentioning that.  First computer terminal I ever used.  I think I 
still have some of the paper tape somewhere.

> The 8-bit byte standard — and its even multiples — is relatively recent in 
> computing history.  You can point to early examples like the 32-bit IBM 360 
> and later ones like the 16-bit Data General Nova and DEC PDP-11, but I 
> believe it was the flood of 8-bit microcomputers in the mid to late 1970s 
> that finally and firmly associated “byte” with “8 bits”.

Again, the word you want is "word".  There were architectures with all sorts of 
weird word sizes.  "byte" always meant "by eight" and was a synonym for "octet".

As Warren wrote, words did not always encode text as 8 bits per character.  
Computers with 16-bit word sizes might encode ASCII as three 5-bit characters 
plus a parity bit, or use two 16-bit words for five 6-bit characters plus 2 
meta-bits.  With each bit of storage costing around 100,000 times what they do 
now, and taking 10,000 times the time to move across your communications 
network, there was a wide variety of ingenious ways to save a bit here and a 
bit there.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-29 Thread Warren Young
On Jun 27, 2017, at 3:02 PM, Keith Medcalf  wrote:
> 
>> The whole point of
>> specifying a format as 7 bits is that the 8th bit is ignored, or
>> perhaps used in an implementation-defined manner, regardless of whether
>> the 8th bit in a char is available or not.
> 
> ASCII was designed back in the days of low reliability serial communications 
> -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 
> stop bits -- to increase the reliability of the communications.  A "byte" was 
> also 9 bits.  8 bits of data and a parity bit.

Before roughly the mid 1970s, the size of a byte was whatever the computer or 
communications system designer said it was.  Even within a single computer + 
serial comm system, the definitions could differ.  For this reason, we also 
have the term “octet,” which unambiguously means an 8-bit unit of data.

The 9-bit byte is largely a DEC-ism, since their pre-PDP-11 machines used a 
word size that was an integer multiple of 6 or 12.  DEC had 12-bit machines, 
18-bit machines, and 36-bit machines.  There was even a plan for a 24-bit 
design at one point.

A common example would be a Teletype Model 33 ASR hardwired by DEC for 
transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity, fed by a 
12-bit PDP-8 pulling that text off an RK05 cartridge disk from a file encoded 
in a 6-bit packed ASCII format.

6-bit packed ASCII schemes were common at the time: to efficiently store plain 
text in the native 12-, 18-, or 36-bit words, programmers would drop most of 
the control characters and punctuation, as well as either dropping or 
shift-encoding lowercase.

That isn’t an innovation from the DEC world, either: Émile Baudot came up with 
basically the same idea in his eponymous 5-bit telegraph code in 1870.  You 
could well say that Baudot code uses 5-bit bytes.  (This is also where the data 
communications unit “baud” comes from.)

The 8-bit byte standard — and its even multiples — is relatively recent in 
computing history.  You can point to early examples like the 32-bit IBM 360 and 
later ones like the 16-bit Data General Nova and DEC PDP-11, but I believe it 
was the flood of 8-bit microcomputers in the mid to late 1970s that finally and 
firmly associated “byte” with “8 bits”.

> Nowadays we use 8 bits for data with no parity

True parity bits (as opposed to mark/space parity) can only detect a 1-bit 
error.  We dropped parity checks when the data rates rose and SNR levels fell 
to the point that single-bit errors were a frequent occurrence, making parity 
checks practically useless.

> no error correction

The wonder, to my mind, is that it’s still an argument whether to use ECC RAM 
in any but the lowest-end machines.  You should have the option to put ECC RAM 
into any machine down to about the $500 level by simply paying a ~25% premium 
on the option cost for non-ECC RAM, but artificial market segmentation has kept 
ECC a feature of the server and high-end PC worlds only.

This sort of penny-pinching should have gone out of style in the 1990s, for the 
same reason Ethernet and USB use smarter error correction than did RS-232.

We should have flowed from parity RAM at the high end to ECC RAM at the high 
end to ECC everywhere by now.

> and no timing bits.

Timing bits aren’t needed when you have clock recovery hardware, which like 
ECC, is a superior technology that should be universal once transistors become 
sufficiently cheap.

Clock recovery becomes necessary once SNR levels get to the point they are now, 
where separate clock lines don’t really help any more.  You’d have to apply 
clock recovery type techniques to the clock line if you had it, so you might as 
well apply it to the data and leave the clock line out.

> Cuz when things screw up we want them to REALLY screw up ... and remain 
> undetectable.

Thus the move toward strongly checksummed filesystems like ZFS, btrfs, HAMMER, 
APFS, and ReFS.

Like ECC, this is a battle that should be over by now, but we’re going to see 
HFS+, NTFS, and extfs hang on for a long time yet because $REASONS.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-28 Thread Rowan Worth
On 27 June 2017 at 18:42, Eric Grange  wrote:

> So while in theory all the scenarios you describe are interesting, in
> practice seeing an utf-8 BOM provides an extremely
> high likeliness that a file will indeed be utf-8. Not always, but a memory
> chip could also be hit by a cosmic ray.
>
> Conversely the absence of an utf-8 BOM means a high probability of
> "something undetermined": ANSI or BOMless utf-8,
> or something more oddball (in which I lump utf-16 btw)... and the need for
> heuristics to kick in.
>

I think we are largely in agreement here (esp. wrt utf-16 being an oddball
interchange format).

It doesn't answer my question though, ie. what advantage the BOM tag
provides compared to assuming utf-8 from the outset. Yes if you see a utf-8
BOM you have immediate confidence that the data is utf-8 encoded, but what
have you lost if you start with [fake] confidence and treat the data as
utf-8 until proven otherwise?

Either the data is utf-8, or ASCII, or ANSI but with no high-bit characters
and everything works, or you find an invalid byte sequence which gives you
high confidence that this is not actually utf-8 data. Granted it requires
more than three bytes lookahead, but we're gonig to be using that data
anyway.

I guess the one clear advantage I see of a utf-8 BOM is that it can
simplify some code, and reduce some duplicate work when interfacing with
APIs which both require a text encoding specified up-front and don't offer
a convenient error path when decoding fails. But adding utf-8 with BOM as
yet another text encoding configuration to the landscape seems like a high
price to pay, and certainly not an overall simplification.

Outside of source code and Linux config files, BOMless utf-8 are certainly
> not the most frequent text files, ANSI and
> other various encodings dominate, because most non-ASCII text files were
> (are) produced under DOS or Windows,
> where notepad and friends use ANSI by default f.i.
>

Notepad barely counts as a text editor (newlines are always two bytes long
yeah? :P), but I take your point that ANSI is common (especially CP1251?).
I've honestly never seen a utf-8 file *with* a BOM though, so perhaps I've
lived a sheltered life.

I'm not sure what you were going for here:

the overwhelming majority of text content are likely to involve ASCII at the

beginning (from various markups, think html, xml, json, source code... even
> csv


Since HTML's encoding is generally specified in the HTTP header or
 metadata.
XML's encoding must be specified on the first line (unless the default
utf-8 is used or a BOM is present).
JSON's encoding must be either utf-8, utf-16 or utf-32.
Source code encoding is generally defined by the language in question.

That may not be a desirable or happy situation, but that is the situation
> we have to deal with.
>

True, we're stuck with decisions of the past. I guess (and maybe I've
finally understood your position?) if a BOM was mandated for _all_ utf-8
data from the outset to clearly distinguish it from pre-existing ANSI
codepages then I could see its value. Although I remain a little revulsed
by having those three little bytes at the front of all my files to solve
what is predominently a transport issue ;)

-Rowan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-28 Thread Peter da Silva
On 6/27/17, 4:02 PM, "sqlite-users on behalf of Keith Medcalf" 
 
wrote:
> Nowadays we use 8 bits for data with no parity, no error correction, and no 
> timing bits.  Cuz when things screw up we want them to REALLY screw up ... 
> and remain undetectable.

Nowadays we use packet checksums and retransmission of corrupted or missing 
packets. 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread John McKown
On Tue, Jun 27, 2017 at 4:02 PM, Keith Medcalf  wrote:

>
> > If an implementation "uses" 8 bits for ASCII text (as opposed to
> > hardware storage which is never less than 8 bits for a single C char,
> > AFAIK), then it is not a valid ASCII implementation, i.e. does not
> > interpret ASCII according to its definition. The whole point of
> > specifying a format as 7 bits is that the 8th bit is ignored, or
> > perhaps used in an implementation-defined manner, regardless of whether
> > the 8th bit in a char is available or not.
>
> ASCII was designed back in the days of low reliability serial
> communications -- you know, back when data was sent using 7 bit data + 1
> parity bits + 2 stop bits -- to increase the reliability of the
> communications.  A "byte" was also 9 bits.  8 bits of data and a parity bit.
>
> Nowadays we use 8 bits for data with no parity, no error correction, and
> no timing bits.  Cuz when things screw up we want them to REALLY screw up
> ... and remain undetectable.
>

​Actually, most _enterprise_ level storage & transmission facilities have
error detection and correction codes which are "transparent" to the
programmer. Almost everybody knows about RAID arrays which (other than
JBOD) have either "parity" (RAID5 is an example) or is "mirrored" (RAID1).
Most have also heard of ECC RAM memory. But I'll bet that few have heard
of​ RAIM memory, which is used on the IBM z series of computers. Redundant
Array of Independent Memory. This is basically "RAID 5" memory. In addition
to the RAID-ness, it still uses ECC as well. Also, unlike with an Intel
machine, if an IBM z suffers a "memory failure", there is usually the
ability for the _hardware_ to recover all the data in the memory module
("block") and transparently copy it to a "phantom" block of memory, which
then takes the place of the block which contains the error. All without
host software intervention.

https://www.ibm.com/developerworks/community/blogs/e0c474f8-3aad-4f01-8bca-f2c12b576ac9/entry/IBM_zEnterprise_redundant_array_of_independent_memory_subsystem
?


-- 
Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Maranatha! <><
John McKown
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Keith Medcalf
 
> If an implementation "uses" 8 bits for ASCII text (as opposed to
> hardware storage which is never less than 8 bits for a single C char,
> AFAIK), then it is not a valid ASCII implementation, i.e. does not
> interpret ASCII according to its definition. The whole point of
> specifying a format as 7 bits is that the 8th bit is ignored, or
> perhaps used in an implementation-defined manner, regardless of whether
> the 8th bit in a char is available or not.

ASCII was designed back in the days of low reliability serial communications -- 
you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop 
bits -- to increase the reliability of the communications.  A "byte" was also 9 
bits.  8 bits of data and a parity bit.

Nowadays we use 8 bits for data with no parity, no error correction, and no 
timing bits.  Cuz when things screw up we want them to REALLY screw up ... and 
remain undetectable.





___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 16:38 +0200, Eric Grange wrote:
> > 
> > ASCII / ANSI is a 7-bit format.
> ASCII is a 7 bit encoding, but uses 8 bits in just about any
> implementation
> out there. I do not think there is any 7 bit implementation still
> alive
> outside of legacy mode for low-level wire protocols (RS232 etc.). I
> personally have never encountered a 7 bit ASCII file (as in
> bitpacked), I
> am curious if any exists?

If an implementation "uses" 8 bits for ASCII text (as opposed to
hardware storage which is never less than 8 bits for a single C char,
AFAIK), then it is not a valid ASCII implementation, i.e. does not
interpret ASCII according to its definition. The whole point of
specifying a format as 7 bits is that the 8th bit is ignored, or
perhaps used in an implementation-defined manner, regardless of whether
the 8th bit in a char is available or not.

Once an encoding embraces 8 bits, it will be something like CP1252,
ISO-8859-x, KOI-R, etc. Just not ASCII.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Eric Grange
> ASCII / ANSI is a 7-bit format.

ASCII is a 7 bit encoding, but uses 8 bits in just about any implementation
out there. I do not think there is any 7 bit implementation still alive
outside of legacy mode for low-level wire protocols (RS232 etc.). I
personally have never encountered a 7 bit ASCII file (as in bitpacked), I
am curious if any exists?

ANSI has no precise definition, it's used to lump together all the <= 8 bit
legacy encodings (cf. https://en.wikipedia.org/wiki/ANSI_character_set)

On Tue, Jun 27, 2017 at 1:53 PM, Simon Slavin  wrote:

>
>
> On 27 Jun 2017, at 7:12am, Rowan Worth  wrote:
>
> > In fact using this assumption we could dispense with the BOM entirely for
> > UTF-8 and drop case 5 from the list.
>
> If you do that, you will try to process the BOM at the beginning of a
> UTF-8 stream as if it is characters.
>
> > So my question is, what advantage does
> > a BOM offer for UTF-8? What other cases can we identify with the
> > information it provides?
>
> Suppose your software processes only UTF-8 files, but someone feeds it a
> file which begins with FE FF.  Your software should recognise this and
> reject the file, telling the user/programmer that it can’t process it
> because it’s in the wrong encoding.
>
> Processing BOMs is part of the work you have to do to make your software
> Unicode-aware.  Without it, your documentation should state that your
> software handles the one flavour of Unicode it handles, not Unicode in
> general.  There’s nothing wrong with this, if it’s all the programmer/user
> needs, as long as it’s correctly documented.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Simon Slavin


On 27 Jun 2017, at 7:12am, Rowan Worth  wrote:

> In fact using this assumption we could dispense with the BOM entirely for
> UTF-8 and drop case 5 from the list.

If you do that, you will try to process the BOM at the beginning of a UTF-8 
stream as if it is characters.

> So my question is, what advantage does
> a BOM offer for UTF-8? What other cases can we identify with the
> information it provides?

Suppose your software processes only UTF-8 files, but someone feeds it a file 
which begins with FE FF.  Your software should recognise this and reject the 
file, telling the user/programmer that it can’t process it because it’s in the 
wrong encoding.

Processing BOMs is part of the work you have to do to make your software 
Unicode-aware.  Without it, your documentation should state that your software 
handles the one flavour of Unicode it handles, not Unicode in general.  There’s 
nothing wrong with this, if it’s all the programmer/user needs, as long as it’s 
correctly documented.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 12:42 +0200, Eric Grange wrote:
> In the real world, text files are heavily skewed towards 8 bit
> formats,
> meaning just three cases dominate the debate:
> - ASCII / ANSI
> - utf-8 with BOM
> - utf-8 without BOM

ASCII / ANSI is a 7-bit format.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Eric Grange
> In case 7 we have little choice but to invoke heuristics or defer to the
> user, yes?

Yes in theory, but "no" in the real world, or rather "not in any way that
matters"

In the real world, text files are heavily skewed towards 8 bit formats,
meaning just three cases dominate the debate:
- ASCII / ANSI
- utf-8 with BOM
- utf-8 without BOM

And further, the overwhelming majority of text content are likely to
involve ASCII at the beginning (from various markups,
think html, xml, json, source code... even csv, because of explicit
separator specification or 1st column name).

So while in theory all the scenarios you describe are interesting, in
practice seeing an utf-8 BOM provides an extremely
high likeliness that a file will indeed be utf-8. Not always, but a memory
chip could also be hit by a cosmic ray.

Conversely the absence of an utf-8 BOM means a high probability of
"something undetermined": ANSI or BOMless utf-8,
or something more oddball (in which I lump utf-16 btw)... and the need for
heuristics to kick in.

Outside of source code and Linux config files, BOMless utf-8 are certainly
not the most frequent text files, ANSI and
other various encodings dominate, because most non-ASCII text files were
(are) produced under DOS or Windows,
where notepad and friends use ANSI by default f.i.

That may not be a desirable or happy situation, but that is the situation
we have to deal with.

It is also the reason why 20 years later the utf-8 BOM is still in use: it
explicit and has a practical success rate higher
than any of the heuristics, while the collisions of the BOM with actual
ANSI (or other) text start are unheard of.


On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove 
wrote:

> On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> > The original issue was two of the largest companies in the world
> > output the
> > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> > UTF-8
> > encoded text streams, and it would be friendly for the SQLite3 shell
> > to
> > skip it or use it for encoding identification in at least some cases.
>
> I would suggest adding a command-line argument to the shell indicating
> whether to ignore a BOM or not, possibly requiring specification of a
> certain encoding or list of encodings to consider.
>
> Certainly this should not be a requirement for the library per se, but
> a responsibility of the client to provide data in the proper encoding.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> The original issue was two of the largest companies in the world
> output the
> Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> UTF-8
> encoded text streams, and it would be friendly for the SQLite3 shell
> to
> skip it or use it for encoding identification in at least some cases.

I would suggest adding a command-line argument to the shell indicating
whether to ignore a BOM or not, possibly requiring specification of a
certain encoding or list of encodings to consider.

Certainly this should not be a requirement for the library per se, but
a responsibility of the client to provide data in the proper encoding.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> On Jun 27, 2017 12:13 AM, "Rowan Worth"  wrote:
> 
> I'm sure I've simplified things with this description - have I missed
> something crucial? Is the BOM argument about future proofing? Are we
> worried about EBCDIC? Is my perspective too anglo-centric?

Thanks, Scott -- nothing crucial, it is already quite good enough for
99% of use cases.

The Wikipedia page on "Byte Order Marks" appears to be quite
comprehensive and lists about a dozen possible BOM sequences:

https://en.wikipedia.org/wiki/Byte_order_mark

Lacking a BOM, I would certainly try to rule out UTF-8 right away by
searching for invalid UTF-8 characters within a reasonably large
portion of the input (maybe 100-300KB?) before then looking for any
NULL bytes (which are also invalid UTF-8 except as a delimiter) or
other random control characters.

As to having the user specify an encoding when dealing with something
which should be text (CSV files, for example) and processing files
which the user has specified, there is always the possibility that the
encoding is different than what the user says, mainly because they
probably clicked on a spreadsheet file with a similar name instead of
the desired text file. If the user specifies an 8-bit encoding aside
from Unicode, it gets very difficult to trap wrong input unless you
write routines to search for invalid characters (e.g. distinguishing
between true ISO-8859-x and CP1252).
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Scott Robison
On Jun 27, 2017 12:13 AM, "Rowan Worth"  wrote:

I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?


The original issue was two of the largest companies in the world output the
Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of UTF-8
encoded text streams, and it would be friendly for the SQLite3 shell to
skip it or use it for encoding identification in at least some cases.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Rowan Worth
On 26 June 2017 at 19:03, Eric Grange  wrote:

> No BOM = you have to fire a whole suite of heuristics or present the user
> with choices he/she will not understand.
>

Requiring heuristics to determine text encoding/codepage exists regardless
of whether BOM is used since the problem predates unicode altogether. Lets
consider the scenarios - as Simon enumerated we're roughly interested in
cases where the data stream begins with byte sequences:

(1) 0x00 0x00 0xFE 0xFF
(2) 0xFF 0xFE 0x00 0x00
(3) 0xFE 0xFF
(4) 0xFF 0xFE
(5) 0xEF 0xBB 0xBF
(6) anything else (and datastream is ASCII or UTF-8)
(7) anything else (and datastream is some random codepage)

In case 7 we have little choice but to invoke heuristics or defer to the
user, yes? For the first 5 cases we can immediately deduce some facts:

(1) -> almost certainly UTF-32BE, athough if NUL characters may be present
8-bit codepages are still a candidate
(2) -> almost certainly UTF-32LE, although if NUL characters may be present
UTF-16LE and 8-bit codepages are still a candidate
(3) -> likely UTF16-BE, but could be some other 8-bit codepage
(4) -> likely UTF16-LE, but could be some other 8-bit codepage
(5) -> almost certainly UTF-8, but could be some other 8-bit codepage

I observe that BOM never provides perfect confidence regarding the
encoding, although in practice I expect it would only fail on data
specifically designed to fool it.

I also suggest that the checks ought to be performed in the order listed,
to avoid categorising UTF-32LE text as UTF-16LE, and because the first 4
cases rule out [valid] UTF-8 data.


Now lets say we make an assumption that all text is UTF-8 until proven
otherwise. In case 6 we get lucky and everything works, and in case 7 we
find invalid characters and fall back to heuristics or the user to identify
the encoding.

In fact using this assumption we could dispense with the BOM entirely for
UTF-8 and drop case 5 from the list. So my question is, what advantage does
a BOM offer for UTF-8? What other cases can we identify with the
information it provides?

If you were going to jump straight from case 5 to case 7 in the absence of
a BOM it seems like you might aswell give UTF-8 a try since it and ASCII
are far and away the common case.


I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?

After 20 years, the choice is between doing the best in an imperfect world,
> or perpetuating the issue and blaming others.
>

By being scalable and general enough to represent all desired characters,
as I see it UTF-8 is not perpetuating any issues but rather offering an out
from historic codepage woes (by adopting it as the go-to interchange
format).

As Peter da Silva said:
> It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage.

-Rowan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users