- Original Message -
From: Mark Davis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, June 06, 2001 07:51
Subject: Re: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
I agree with Peter that Mori's argument on this point doesn't hold water.
We
can
]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, June 05, 2001 14:12
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
I am not an advocate of UTF-8s -- I am just trying to dispell some of the
noise here. I have some specific answers below, but in general:
1. Strict means according
PROTECTED]]On
Behalf Of Carl W. Brown
Sent: Tuesday, June 05, 2001 11:09 AM
To: [EMAIL PROTECTED]
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Mark,
Now I understand.
If they implement a UTF-16 strcmp function that is a case sensitive version
of a UTF-16 strcasecmp(stricmp) you will get the same
Thanks. That's Markus's invention.
Mark
- Original Message -
From: Carl W. Brown [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 06, 2001 11:08
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Mark,
I like the clever ICU technique for sorting in code point order
]
To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, June 04, 2001 09:44
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
From: Mark Davis [EMAIL PROTECTED]
2. Auto-detection does not particularly favor one side or the other.
UTF-8
On 05/06/2001 13:03:03 Marco Cimarosti wrote:
[...]
But how should this 6-byte sequence be interpreted by a standard UTF-8
decoder? Does it become one or two code points?
That depends on where the decoder is. If it's inside an XML
parser, then it becomes neither of the above, but rather a
One.
I put samples on:
http://www.macchiato.com/utc/samples_of_utf8.htm
Mark
- Original Message -
From: Marco Cimarosti [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: 'Mark Davis' [EMAIL PROTECTED]
Sent: Tuesday, June 05, 2001 05:03
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8
: 'Mark Davis' [EMAIL PROTECTED]
Sent: Tuesday, June 05, 2001 05:03
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Mark Davis wrote:
- I am well aware that one can accept 6-byte supplementary
characters on
input in UTF-8. (Did you really think I wasn't?)
(O, no, I know you knew
Sorry, didn't mean to put words in your mouth.
Mark
- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, June 04, 2001 21:51
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
On 06/04/2001 10:58:22 PM Mark Davis wrote:
- Peter was saying that you
No problem. I just wanted to give credit where it was due.
Peter
Sorry, didn't mean to put words in your mouth.
Mark
- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, June 04, 2001 21:51
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
On 06/04
Mark Davis wrote:
It is either one code point (lenient parser) or an error
(strict parser). It is never two.
I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:
1) The difference between lenient vs. strict parsers.
2) The
Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Monday, June 04, 2001 9:23 PM
To: Carl W. Brown; [EMAIL PROTECTED]
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Nobody has ever proposed binary compares between UTF-8 and UTF-16 strings.
The scenario is:
Client software uses
I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:
1) The difference between lenient vs. strict parsers.
That has to do with XML conformance, not Unicode. You were looking in the
wrong spec.
2) The rule that an UTF-8 sequence
Note that there has already been rather violent negative reaction from the
W3C side against the idea of supporting any such change here, whether the
UTC accepted a change or not. If it is the eventual goal of these people
to
submit a UTC-approved UTF-8 variant then they should consider this fact
On 06/05/2001 09:30:00 AM Mark Davis wrote:
I put samples on:
http://www.macchiato.com/utc/samples_of_utf8.htm
One thing doesn't make sense here: you have strict under UTF-8s. Strict
in relation to what? Strict in relation to UTF-8 had to do with XML. Under
XML, *all* the entries under the
-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Marco Cimarosti
Sent: Tuesday, June 05, 2001 4:46 AM
To: '[EMAIL PROTECTED]'
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Mark Davis wrote:
The scenario is:
Client software uses UTF-16.
Database software uses UTF-8s
[EMAIL PROTECTED] scripsit:
XML requires (recommends?) data to be
normalised in normal form C. That imposes private (well, open actually, but
private in the sense of limited to that protocol) constraints against
otherwise legal Unicode character sequences.
Currently it neither requires nor
Subject: Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))
I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:
1) The difference between lenient vs. strict parsers.
That has to do with XML conformance, not Unicode. You
PROTECTED]
Sent: Tuesday, June 05, 2001 11:05
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
On 06/05/2001 09:30:00 AM Mark Davis wrote:
I put samples on:
http://www.macchiato.com/utc/samples_of_utf8.htm
One thing doesn't make sense here: you have strict under UTF-8s. Strict
Personally, I find it interesting to see which and how many characters are affected by
the difference in binary ordering between UTF-8 and UTF-16.
Affected are all code points in two ranges:
U+e000..U+
U+1..U+10
The second range contains assignments for characters that are
On 06/05/2001 04:12:59 PM Mark Davis wrote:
I am not an advocate of UTF-8s -- I am just trying to dispell some of the
noise here.
I realise that, and wasn't meaning to suggest that I think *you* are taking
the wrong position. I do appreciate the comments you have made, which have
been, it
In a message dated 2001-06-05 14:24:38 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
For me, that would be the one positive for defining UTF-8S: we could then
tighten up the definition of UTF-8 to require it to exclude 6-byte forms on
input. You could then have:
UTF-8: only emits
In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
It would seem to me that there's
another issue that has to be taken into consideration here: normalisation.
You can't just do a simple sort using raw binary comparison; you have to
normalise strings
On 06/04/2001 02:10:35 AM Doug Ewell wrote:
While we are at it, here's another argument against the existence of both
UTF-8 and this new UTF-8s. Recently there was a discussion about the use
of
the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts
arguing against its
From: Mark Davis [EMAIL PROTECTED]
2. Auto-detection does not particularly favor one side or the other.
UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
supplementary character expressed with two 3-byte values, you know you do
not have pure UTF-8. If you ever encounter
From: [EMAIL PROTECTED]
On 06/04/2001 02:10:35 AM Doug Ewell wrote:
While we are at it, here's another argument against the existence of both
UTF-8 and this new UTF-8s. Recently there was a discussion about the use
of
the U+FEFF signature in UTF-8 files, with a fair number of Unicode
: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, June 04, 2001 00:10
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
It would seem to me that there's
another issue
PROTECTED]]On
Behalf Of Mark Davis
Sent: Monday, June 04, 2001 8:47 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
I am not, myself, in favor of UTF-8s. However, I do want to point out a few
things.
1) Normalization does
On 06/04/2001 10:47:20 AM Mark Davis wrote:
The best practice for that case is to enforce normalization on data fields
*when the text is inserted in the field* . If one does, then canonical
equivalents will compare as equal, whether they are encoded in UTF-8,
UTF-8s, or UTF-16 (or, for that
PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, June 04, 2001 12:55
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Mark,
I think that I am missing some point. Form what I hear the issue is that
they want a way to support identical compares. This order is not
important.
What is important
From: Misha Wolf [EMAIL PROTECTED]
Let's be careful with the word legal. The strange (per-)version of
UTF-8 which re-encodes UTF-16 is legal input as far as The Unicode
Standard is concerned. It is, however, totally illegal as far as the
IETF, the Internet, the W3C, the WWW, XML, and HTML
From: Marco Cimarosti [EMAIL PROTECTED]
No, please, let's not make waters more muddied than they already are.
Let's
keep on calling Oracle's proposal UTF-8S, as there is no point in
finding
a cuter name for it.
Fair enough.
Wrong point! Perhaps it will not hurt applications which read text
One more thought on this topic: the issue has to do with comparing the
results of sorting two data sources. It would seem to me that there's
another issue that has to be taken into consideration here: normalisation.
You can't just do a simple sort using raw binary comparison; you have to
Kenneth Whistler wrote:
Plane 14 PUA usage description tags? Naaah, nobody would suggest such
a bizarre thing, would they?
Marco Cimarosti wrote:
The three words PUA usage description are redundant, methinks. Removing
them leaves a more concise and dramatic example of a weird proposal.
At 4:44 AM -0600 6/1/01, Bill Kurmey wrote:
Kenneth Whistler wrote:
Plane 14 PUA usage description tags? Naaah, nobody would suggest such
a bizarre thing, would they?
Marco Cimarosti wrote:
The three words PUA usage description are redundant, methinks. Removing
them leaves a more concise and
Kenneth Whistler wrote:
Plane 14 PUA usage description tags? Naaah, nobody would suggest such
a bizarre thing, would they?
The three words PUA usage description are redundant, methinks. Removing
them leaves a more concise and dramatic example of a weird proposal.
_ Marco
In a message dated 2001-05-29 12:42:38 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
How long is the past? I remember reading about these surrogates the
first
time I put my hands on a draft copy of ISO 10646. It was nearly six years
ago.
Surrogate range was defined there but no
In a message dated 2001-05-29 11:20:48 Pacific Daylight Time, [EMAIL PROTECTED]
writes:
The point is that while the UTC did not endorse this proposal as
of May 23, 2001, the pressure to create a UTF-8S is rising, and there
is no guarantee that the UTC will not sway to such support in
On 05/30/2001 01:31:17 AM Doug Ewell wrote:
I hate to say it, but this is really damaging my faith in the
standardization
process...
I don't (and shouldn't) have the ability to pressure the UTC to approve a
new
encoding form to make up for my inability to conform to the existing ones,
and
Doug Ewell wrote:
The proponents of UTF-8S are
vigorously and actively campaigning for their proposal. In
standardization committees, proposals that have committed, active
proponents who can aim for the long haul, often have a way of getting
adopted in one form or another, unless
In a message dated 2001-05-28 13:56:50 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
The problem with databases is that you have to have a locale independent
sorting sequence. If you store a record with a key built with one locale,
you might not be able to retrieve it using another
41 matches
Mail list logo