Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread John H. Jenkins
On Dec 10, 2004, at 1:25 PM, Tim Greenwood wrote: Is that like the 'Please RSVP' that I see all too often? Or should that not be excused? Or -- my own personal favorite -- "in the year AD 2004."

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread Mark Davis
<[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, December 13, 2004 08:21 Subject: Re: US-ASCII (was: Re: Invalid UTF-8 sequences) > Mark Davis schrieb: > > This is just a confusion among the hoi polloi. > > And here we have yet another example: "hoi" is Greek

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread Otto Stolz
Mark Davis schrieb: This is just a confusion among the hoi polloi. And here we have yet another example: "hoi" is Greek for "the" ("hoi polloi" = "the many"). Best wishes, Otto Stolz

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: > This is a known caveat even for Unix, when you look at the > tricky details of > the support of Windows file sharing through Samba, when the > client requests > a file with a "sho

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Doug Ewell
Michael Everson wrote: > In Ireland sometime in the early nineties, the Allied Irish Bank > became AIB Bank, the Allied Irish Bank Bank. Israel Discount Bank of New York regularly refers to itself as "IDB Bank." -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Peter Kirk
On 11/12/2004 02:29, Mark Davis wrote: This is just a confusion among the hoi polloi. âMark But such things happen not just among the German and Swedish polloi, but even in the crowning heights of the English language. The word "cherubims" is used many times in the King James Bible and at leas

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Lars responded: > > > > ... Whatever the solutions > > > for representation of corrupt data bytes or uninterpreted data > > > bytes on conversion to Unicode may be, that is ir

Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)

2004-12-11 Thread Lars Kristan
Title: Roundtripping in Unicode (was RE: Invalid UTF-8 sequences) Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > Quite close. Except for the fact that: > > * U+EE93 is represented in UTF-32 as 0xEE93 > > * U

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Johannes Bergerhausen
Am 11.12.2004 um 04:32 schrieb Clark Cox: There are always the classics: "ATM Machine" and "PIN Number" Here in germany, they say "ASCII-Code". :-) Johannes

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: > However, although they are *technically* octet sequences, they > are *functionally* character strings.  That's the issue. Nicely put! But UTC does not seem to care. > > > The point I'm

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Michael Everson
At 17:38 -0800 2004-12-10, Asmus Freytag wrote: Other examples of apparent redundancy, are Cakes -> Keks (German), plural Kekse Baby -> bebis (Swedish), plural bebissar and there are many more such examples. In Ireland sometime in the early nineties, the Allied Irish Bank became AIB Bank, the Alli

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: > >> Windows filesystems do know what encoding they use. > >Err, not really. MS-DOS *need to know* the encoding to use, > >a bit like a > >*nix application that displays filenames need

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Clark Cox
On Fri, 10 Dec 2004 13:28:59 -0800, Michael (michka) Kaplan <[EMAIL PROTECTED]> wrote: > From: "Kenneth Whistler" <[EMAIL PROTECTED]> > > > On the other hand, for many English speakers, "RSVP" is simply > > learned as an unanalyzed verb, pronounced "aressveepee", meaning > > "send a response to th

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Asmus Freytag
At 12:50 PM 12/10/2004, Kenneth Whistler wrote: Tim Greenwood asked: > > ... a perfectly normal linguistic process of > > attributive disambiguation of a term which had grown ambiguous > > in usage. > > Is that like the 'Please RSVP' that I see all too often? Or should > that not be excused? *grins

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Mark Davis
This is just a confusion among the hoi polloi. âMark - Original Message - From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 17:38 S

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Michael \(michka\) Kaplan
From: "Kenneth Whistler" <[EMAIL PROTECTED]> > On the other hand, for many English speakers, "RSVP" is simply > learned as an unanalyzed verb, pronounced "aressveepee", meaning > "send a response to this message". And to castigate such speakers > for politely prepending a "please" to that verb is

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread John Cowan
Kenneth Whistler scripsit: > On the other hand, for many English speakers, "RSVP" is simply > learned as an unanalyzed verb, pronounced "aressveepee", meaning > "send a response to this message". And to castigate such speakers > for politely prepending a "please" to that verb is a little > too muc

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Kenneth Whistler
Tim Greenwood asked: > > ... a perfectly normal linguistic process of > > attributive disambiguation of a term which had grown ambiguous > > in usage. > > Is that like the 'Please RSVP' that I see all too often? Or should > that not be excused? *grins* Well, technically, that is not a case of at

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Tim Greenwood
On Fri, 10 Dec 2004 12:06:12 -0800 (PST), Kenneth Whistler <[EMAIL PROTECTED]> wrote: > In addition to Doug's historical clarification, you need to > understand this as a perfectly normal linguistic process of > attributive disambiguation of a term which had grown ambiguous > in usage. Is that li

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread Kenneth Whistler
> If any > criticism was present, it referred to the redundant "US-" prefix in > "US-ASCII", not to Unicode, and even that wasn't really criticism, just my > lack of understanding /why/. In addition to Doug's historical clarification, you need to understand this as a perfectly normal linguistic

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-09 Thread Arcane Jill
- Original Message - From: "Arcane Jill" <[EMAIL PROTECTED]> To: "Unicode" <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 7:17 AM Subject: RE: US-ASCII (was: Re: Invalid UTF-8 sequences) Yes, of course it was a joke. Rest assured, if I perceive any k

RE: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-09 Thread Arcane Jill
next time. :-) Oh, and thanks for the interesting historical character set info. Jill -Original Message- From: Doug Ewell [mailto:[EMAIL PROTECTED] Sent: 09 December 2004 16:28 To: Unicode Mailing List Cc: Arcane Jill Subject: US-ASCII (was: Re: Invalid UTF-8 sequences) I hope that's j

US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-09 Thread Doug Ewell
Arcane Jill wrote: > [OFF TOPIC] Why do so many people call it "US ASCII" anyway? Since > "ASCII" comprises that subset of Unicode from U+ to U+007F, it is > not clear to me in what way "US-ASCII" is different from ASCII. It's > bad enough for us non-Americans that the A in ASCII already stan

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy
From: "Antoine Leca" <[EMAIL PROTECTED]> Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it,

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 09 December 2004 11:29 To: Unicode Mailing List Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF) Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Antoine Leca
On Monday, December 6th, 2004 20:52Z John Cowan va escriure: > Doug Ewell scripsit: > >>> Now suppose you have a UNIX filesystem, containing filenames in a >>> legacy encoding (possibly even more than one). If one wants to >>> switch to UTF-8 filenames, what is one supposed to do? Convert all >>>

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
Lars responded: > > ... Whatever the solutions > > for representation of corrupt data bytes or uninterpreted data > > bytes on conversion to Unicode may be, that is irrelevant to the > > concerns on whether an application is using UTF-8 or UTF-16 > > or UTF-32. > The important fact is that if you

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread John Cowan
Kenneth Whistler scripsit: > A Sybase ASE database has the same behavior running on Windows as > running on Sun Solaris or Linux, for that matter. Fair enough. > UNIX filenames are just one instance of this. However, although they are *technically* octet sequences, they are *functionally* char

Re: Invalid UTF-8 sequences

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Quite close. Except for the fact that: > * U+EE93 is represented in UTF-32 as 0xEE93 > * U+EE93 is represented in UTF-16 as 0xEE93 > * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93) Then it would be impossible to represent sequences li

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
John Cowan responded: > > Storage of UNIX filenames on Windows databases, for example, ^^ O.k., I just quoted this back from the original email, but it really is a complete misconception of the issue for databases. "Windows databases" is a misn

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > I'm going to step in here, because this argument seems to > be generating more heat than light. I agree, and I thank you for that. > First, I'm going to summarize what I think Lars Kristan

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) > Needless to say, these systems were badly designed at their > origin, and > newer filesystems (and OS APIs) offer much better > alternative, by either > storing explicitly on volumes which encoding it uses, or by

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > How do file names work when the user changes from one SBCS to another > (let's ignore UTF-8 for now) where the interpretation is > different?  For > example, byte C3 is U+00C3, A with tilde (Ã) in I

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Lars Kristan wrote: > I never said it doesn't violate any existing rules. Stating that it > does, doesn't help a bit. Rules can be changed. Assuming we understand > the consequences. And that is what we should be discussing

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
ripping > of byte values uninterpretable as characters to be converted, and > is asking for standard Unicode values for this purpose, instead. If I understand correctly, he is using these PUA values when the data is in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences) when

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Philippe Verdy wrote: > An alternative can then be a mixed encoding selection: > - choose a legacy encoding that will most often be able to represent > valid filenames without loss of information (for example ISO-8859-1, > or Cp1252). > - encode the filename with it. > - try to decode it with a *

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan
Kenneth Whistler scripsit: > Storage of UNIX filenames on Windows databases, for example, > can be done with BINARY fields, which correctly capture the > identity of them as what they are: an unconvertible array of > byte values, not a convertible string in some particular > code page. This solut

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Kenneth Whistler
Lars, I'm going to step in here, because this argument seems to be generating more heat than light. > I never said it doesn't violate any existing rules. Stating that it does, > doesn't help a bit. Rules can be changed. > I ask you to step back and try to see the big picture. First, I'm going

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell replied: > Actually the Unicode Technical Committee.  But you are > correct: it is up > to the UTC to decide whether they want to redefine UTF-8 to permit > invalid sequences, which are to be interprete

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > John Cowan wrote: > > > Windows filesystems do know what encoding they use.  But a > filename on > > a Unix(oid) file system is a mere sequence of octets, of > which only 00 &g

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
John Cowan wrote: > Windows filesystems do know what encoding they use. But a filename on > a Unix(oid) file system is a mere sequence of octets, of which only 00 > and 2F are interpreted. (Filenames containing 20, and especially 0A, > are annoying to handle with standard tools, but not illegal

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread John Cowan
Doug Ewell scripsit: > > Now suppose you have a UNIX filesystem, containing filenames in a > > legacy encoding (possibly even more than one). If one wants to switch > > to UTF-8 filenames, what is one supposed to do? Convert all filenames > > to UTF-8? > > Well, yes. Doesn't the file system dict

Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
same questions I need to ask you for the process of > converting the data. You are saying that data that cannot be converted > should not be converted. Go back and read that last sentence again, please. Done? Yes, that is what I am saying. > Then what if THIS process is not interactive?

Invalid UTF-8 sequences (was RE: Unicode Search Engines)

2002-01-29 Thread Lars Kristan
ASCII, but the neighboring text could be in an unknown codeset. Simply including a portion of that file into an html file marked as UTF-8 can obviously result in invalid UTF-8 sequences. That is somewhat bad in itself, and it gets even worse if for some reason that file is converted from UTF-8 to U