Fw: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-13 Thread Mark Davis
- Original Message - From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, June 06, 2001 07:51 Subject: Re: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) I agree with Peter that Mori's argument on this point doesn't hold water. We can

Fw: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-13 Thread Mark Davis
] To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, June 05, 2001 14:12 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) I am not an advocate of UTF-8s -- I am just trying to dispell some of the noise here. I have some specific answers below, but in general: 1. Strict means according

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-06 Thread Carl W. Brown
PROTECTED]]On Behalf Of Carl W. Brown Sent: Tuesday, June 05, 2001 11:09 AM To: [EMAIL PROTECTED] Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark, Now I understand. If they implement a UTF-16 strcmp function that is a case sensitive version of a UTF-16 strcasecmp(stricmp) you will get the same

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-06 Thread Mark Davis
Thanks. That's Markus's invention. Mark - Original Message - From: Carl W. Brown [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, June 06, 2001 11:08 Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark, I like the clever ICU technique for sorting in code point order

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Mark Davis
] To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, June 04, 2001 09:44 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) From: Mark Davis [EMAIL PROTECTED] 2. Auto-detection does not particularly favor one side or the other. UTF-8

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Misha . Wolf
On 05/06/2001 13:03:03 Marco Cimarosti wrote: [...] But how should this 6-byte sequence be interpreted by a standard UTF-8 decoder? Does it become one or two code points? That depends on where the decoder is. If it's inside an XML parser, then it becomes neither of the above, but rather a

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Mark Davis
One. I put samples on: http://www.macchiato.com/utc/samples_of_utf8.htm Mark - Original Message - From: Marco Cimarosti [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: 'Mark Davis' [EMAIL PROTECTED] Sent: Tuesday, June 05, 2001 05:03 Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Mark Davis
: 'Mark Davis' [EMAIL PROTECTED] Sent: Tuesday, June 05, 2001 05:03 Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis wrote: - I am well aware that one can accept 6-byte supplementary characters on input in UTF-8. (Did you really think I wasn't?) (O, no, I know you knew

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Mark Davis
Sorry, didn't mean to put words in your mouth. Mark - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, June 04, 2001 21:51 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) On 06/04/2001 10:58:22 PM Mark Davis wrote: - Peter was saying that you

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Peter_Constable
No problem. I just wanted to give credit where it was due. Peter Sorry, didn't mean to put words in your mouth. Mark - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, June 04, 2001 21:51 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) On 06/04

UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Marco Cimarosti
Mark Davis wrote: It is either one code point (lenient parser) or an error (strict parser). It is never two. I am a little bit confused. I re-read conformance rules and the UTF-8 Corrigendum, and I could find these two things: 1) The difference between lenient vs. strict parsers. 2) The

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Carl W. Brown
Message- From: Mark Davis [mailto:[EMAIL PROTECTED]] Sent: Monday, June 04, 2001 9:23 PM To: Carl W. Brown; [EMAIL PROTECTED] Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Nobody has ever proposed binary compares between UTF-8 and UTF-16 strings. The scenario is: Client software uses

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Peter_Constable
I am a little bit confused. I re-read conformance rules and the UTF-8 Corrigendum, and I could find these two things: 1) The difference between lenient vs. strict parsers. That has to do with XML conformance, not Unicode. You were looking in the wrong spec. 2) The rule that an UTF-8 sequence

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Peter_Constable
Note that there has already been rather violent negative reaction from the W3C side against the idea of supporting any such change here, whether the UTC accepted a change or not. If it is the eventual goal of these people to submit a UTC-approved UTF-8 variant then they should consider this fact

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Peter_Constable
On 06/05/2001 09:30:00 AM Mark Davis wrote: I put samples on: http://www.macchiato.com/utc/samples_of_utf8.htm One thing doesn't make sense here: you have strict under UTF-8s. Strict in relation to what? Strict in relation to UTF-8 had to do with XML. Under XML, *all* the entries under the

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Carl W. Brown
- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Marco Cimarosti Sent: Tuesday, June 05, 2001 4:46 AM To: '[EMAIL PROTECTED]' Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis wrote: The scenario is: Client software uses UTF-16. Database software uses UTF-8s

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread John Cowan
[EMAIL PROTECTED] scripsit: XML requires (recommends?) data to be normalised in normal form C. That imposes private (well, open actually, but private in the sense of limited to that protocol) constraints against otherwise legal Unicode character sequences. Currently it neither requires nor

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Mark Davis
Subject: Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)) I am a little bit confused. I re-read conformance rules and the UTF-8 Corrigendum, and I could find these two things: 1) The difference between lenient vs. strict parsers. That has to do with XML conformance, not Unicode. You

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Mark Davis
PROTECTED] Sent: Tuesday, June 05, 2001 11:05 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) On 06/05/2001 09:30:00 AM Mark Davis wrote: I put samples on: http://www.macchiato.com/utc/samples_of_utf8.htm One thing doesn't make sense here: you have strict under UTF-8s. Strict

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Markus Scherer
Personally, I find it interesting to see which and how many characters are affected by the difference in binary ordering between UTF-8 and UTF-16. Affected are all code points in two ranges: U+e000..U+ U+1..U+10 The second range contains assignments for characters that are

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Peter_Constable
On 06/05/2001 04:12:59 PM Mark Davis wrote: I am not an advocate of UTF-8s -- I am just trying to dispell some of the noise here. I realise that, and wasn't meaning to suggest that I think *you* are taking the wrong position. I do appreciate the comments you have made, which have been, it

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread DougEwell2
In a message dated 2001-06-05 14:24:38 Pacific Daylight Time, [EMAIL PROTECTED] writes: For me, that would be the one positive for defining UTF-8S: we could then tighten up the definition of UTF-8 to require it to exclude 6-byte forms on input. You could then have: UTF-8: only emits

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread DougEwell2
In a message dated 2001-06-03 18:04:17 Pacific Daylight Time, [EMAIL PROTECTED] writes: It would seem to me that there's another issue that has to be taken into consideration here: normalisation. You can't just do a simple sort using raw binary comparison; you have to normalise strings

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Peter_Constable
On 06/04/2001 02:10:35 AM Doug Ewell wrote: While we are at it, here's another argument against the existence of both UTF-8 and this new UTF-8s. Recently there was a discussion about the use of the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts arguing against its

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Michael \(michka\) Kaplan
From: Mark Davis [EMAIL PROTECTED] 2. Auto-detection does not particularly favor one side or the other. UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a supplementary character expressed with two 3-byte values, you know you do not have pure UTF-8. If you ever encounter

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Michael \(michka\) Kaplan
From: [EMAIL PROTECTED] On 06/04/2001 02:10:35 AM Doug Ewell wrote: While we are at it, here's another argument against the existence of both UTF-8 and this new UTF-8s. Recently there was a discussion about the use of the U+FEFF signature in UTF-8 files, with a fair number of Unicode

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Mark Davis
: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, June 04, 2001 00:10 Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) In a message dated 2001-06-03 18:04:17 Pacific Daylight Time, [EMAIL PROTECTED] writes: It would seem to me that there's another issue

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Carl W. Brown
PROTECTED]]On Behalf Of Mark Davis Sent: Monday, June 04, 2001 8:47 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) I am not, myself, in favor of UTF-8s. However, I do want to point out a few things. 1) Normalization does

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Peter_Constable
On 06/04/2001 10:47:20 AM Mark Davis wrote: The best practice for that case is to enforce normalization on data fields *when the text is inserted in the field* . If one does, then canonical equivalents will compare as equal, whether they are encoded in UTF-8, UTF-8s, or UTF-16 (or, for that

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Mark Davis
PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, June 04, 2001 12:55 Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark, I think that I am missing some point. Form what I hear the issue is that they want a way to support identical compares. This order is not important. What is important

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Michael \(michka\) Kaplan
From: Misha Wolf [EMAIL PROTECTED] Let's be careful with the word legal. The strange (per-)version of UTF-8 which re-encodes UTF-16 is legal input as far as The Unicode Standard is concerned. It is, however, totally illegal as far as the IETF, the Internet, the W3C, the WWW, XML, and HTML

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-04 Thread Michael \(michka\) Kaplan
From: Marco Cimarosti [EMAIL PROTECTED] No, please, let's not make waters more muddied than they already are. Let's keep on calling Oracle's proposal UTF-8S, as there is no point in finding a cuter name for it. Fair enough. Wrong point! Perhaps it will not hurt applications which read text

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-03 Thread Peter_Constable
One more thought on this topic: the issue has to do with comparing the results of sorting two data sources. It would seem to me that there's another issue that has to be taken into consideration here: normalisation. You can't just do a simple sort using raw binary comparison; you have to

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-01 Thread Bill Kurmey
Kenneth Whistler wrote: Plane 14 PUA usage description tags? Naaah, nobody would suggest such a bizarre thing, would they? Marco Cimarosti wrote: The three words PUA usage description are redundant, methinks. Removing them leaves a more concise and dramatic example of a weird proposal.

Silliness (was RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-01 Thread Edward Cherlin
At 4:44 AM -0600 6/1/01, Bill Kurmey wrote: Kenneth Whistler wrote: Plane 14 PUA usage description tags? Naaah, nobody would suggest such a bizarre thing, would they? Marco Cimarosti wrote: The three words PUA usage description are redundant, methinks. Removing them leaves a more concise and

RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-31 Thread Marco Cimarosti
Kenneth Whistler wrote: Plane 14 PUA usage description tags? Naaah, nobody would suggest such a bizarre thing, would they? The three words PUA usage description are redundant, methinks. Removing them leaves a more concise and dramatic example of a weird proposal. _ Marco

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-30 Thread DougEwell2
In a message dated 2001-05-29 12:42:38 Pacific Daylight Time, [EMAIL PROTECTED] writes: How long is the past? I remember reading about these surrogates the first time I put my hands on a draft copy of ISO 10646. It was nearly six years ago. Surrogate range was defined there but no

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-30 Thread DougEwell2
In a message dated 2001-05-29 11:20:48 Pacific Daylight Time, [EMAIL PROTECTED] writes: The point is that while the UTC did not endorse this proposal as of May 23, 2001, the pressure to create a UTF-8S is rising, and there is no guarantee that the UTC will not sway to such support in

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-30 Thread Peter_Constable
On 05/30/2001 01:31:17 AM Doug Ewell wrote: I hate to say it, but this is really damaging my faith in the standardization process... I don't (and shouldn't) have the ability to pressure the UTC to approve a new encoding form to make up for my inability to conform to the existing ones, and

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-30 Thread Kenneth Whistler
Doug Ewell wrote: The proponents of UTF-8S are vigorously and actively campaigning for their proposal. In standardization committees, proposals that have committed, active proponents who can aim for the long haul, often have a way of getting adopted in one form or another, unless

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-28 Thread DougEwell2
In a message dated 2001-05-28 13:56:50 Pacific Daylight Time, [EMAIL PROTECTED] writes: The problem with databases is that you have to have a locale independent sorting sequence. If you store a record with a key built with one locale, you might not be able to retrieve it using another