- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, June 11, 2001 08:06
Subject: Re: UTF-8 Syntax
> I would rank the situations, from a interpreter's point of view, in the
> followi
Marco Cimarosti <[EMAIL PROTECTED]> wrote:
>My assumption was that, in the first case (no sort order requested by the
>client), a server could in theory provide a result set randomly shuffled.
Of
>course, I know that this won't normally happen but, however, the server is
>allowed to provide whate
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Monday, June 11, 2001 2:33 AM
To: [EMAIL PROTECTED]
Subject: RE: UTF-8 Syntax
>Carl W. Brown <[EMAIL PROTECTED]> wrote:
>>In the case of strcmp the problem is
[EMAIL PROTECTED] wrote:
>
> Carl W. Brown <[EMAIL PROTECTED]> wrote:
> >In the case of strcmp the problem is that this won't even work on UCS-2.
> >It detects the end of string with a single byte 0x00. You have to use a
> >special Unicode compare routine this routine needs to be fixed to produc
Jianping Yang wrote:
>
> [UTF-8S] will fix the following problem for example:
> For a searching engine to search the character U-0001 in UTF-8 string, and it
> could not find. But when UTF-8 is converted into UTF-16, it can found it there
> because and are converted into U-0001000 in UTF-
>Yes, all these uses may be internal to each
>vendor, but as Uma has stated, internal representation leaks out. If any
>significant number of vendors are going to be using this encoding
>internally in their systems, wouldn't it make sense to have a UTR
>describing what this representation is, wh
Toby Phipps wrote:
> This is incredibly inefficient, not only
> because significant amounts of temporary space needs to be
> allocated and
> freed, but also because the entire result set of the query has to be
> processed and sorted before the first row is returned. With
> result sets
> involvi
Toby Phipps wrote:
> [...] There's been a lot of talk about the UTF-8S
> proposal on both the unicode and unicore list, so please
> forgive me (and
> notify me if you feel the need) if I have missed any of the
> salient points that require a response.
You made a very clever summary of such a lo
Carl W. Brown <[EMAIL PROTECTED]> wrote:
>In the case of strcmp the problem is that this won't even work on UCS-2.
It
>detects the end of string with a single byte 0x00. You have to use a
>special Unicode compare routine this routine needs to be fixed to produce
>proper compares. Most likely yo
o:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Friday, June 08, 2001 4:11 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: UTF-8 Syntax
As one of the proponents of the UTF-8S proposal, I feel compelled to
respond to some of the recent comments regarding the proposal on the
u
* Lars Marius Garshol
|
| It seems to me that a lenient search engine, since it searches in an
| index it has built for itself, would turn the UTF-8 it indexes into
| a canonical form (say ). It would then canonicalize any
| strings it is asked to search for into the same form, regardless of
| wh
Lars M. responded:
> |
> | A *lenient* search engine could also search for the irregular
> | pattern, i.e., it could consider and | 80> both to be matches for U-0001, but that would slow it down.
>
> It seems to me that a lenient search engine, since it searches in an
> index it has built
On 06/08/2001 07:26:54 PM Jianping Yang wrote:
>The issue comes from unpaired surrogates as and can
be
>in UTF-8 and your search for (which is Unicode scalar value
>U-0001) cannot find it. But however, when the UTF-8 string converted
into
>UTF-16, and will become
>, and you can find t
Ken,
Thanks, your comment could close this argument against UTF-8S syntax as the attack
here is groundless now, because there is no need to encoding and as separate *paired* surrogates in UTF-8S and they will always be converted
into 0x1 in UTF-32 or in UTF-8. So there is no ambiguity an
Jianping said:
> The issue comes from unpaired surrogates as and
These are not *unpaired* surrogates -- they are *paired* surrogates.
Else your equating them to or U-0001 would make no sense.
> can be
> in UTF-8
They cannot be in well-formed UTF-8. They can only be in ill-formed
UTF-8
* Kenneth Whistler
|
| The problem comes when someone, contrary to the conformance
| requirements of the standard, has emitted irregular UTF-8 for the
| character in question, so that instead of , the string
| has in it.
|
| A *lenient* search engine could also search for the irregular
| patte
> From: Jianping Yang [mailto:[EMAIL PROTECTED]]
> The issue comes from unpaired surrogates as and
> can be
> in UTF-8 and your search for (which is Unicode
> scalar value
> U-0001) cannot find it.
This is good, because is U-d800 and is
U-dc00, so they should not mat
The issue comes from unpaired surrogates as and can be
in UTF-8 and your search for (which is Unicode scalar value
U-0001) cannot find it. But however, when the UTF-8 string converted into
UTF-16, and will become
, and you can find the same character by searching in
UTF-16.
Unless this
Toby, I think you forgot to comment on these objections that have also
been coming up from time to time:
* Introduction of UTF-8S would merely add to the myriad forms people would
already have to support, and it is insufficiently distinuguishable from
UTF-8.
* encoding ambiguities in the s
As one of the proponents of the UTF-8S proposal, I feel compelled to
respond to some of the recent comments regarding the proposal on the
unicode and unicore lists. Although there have been some good comments
about how the goals of the proposal could be accomplished without a new
encoding form, t
On 06/08/2001 01:33:16 PM Jianping Yang wrote:
10 lines of new text, and 141 lines of quoted text without any comments
interspersed. Please edit your responses.
- Peter
---
Peter Constable
Non-Roman Script Initiative,
>Peter,
>
>There is a standard Unicode sort order, the code point sort order. This
>proposal calls for establishing and alternate code point order by
>establishing a new set of encoding schemes.
Yes, I'm very aware of what it is calling for. (You've missed my voluminous
comments against it?)
M
Jianping wrote:
> From your analysis, it make me more believe that we need a UTF-8S not only for the
> binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As
> proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same
> property which is differen
>This will fix the following problem for example:
>For a searching engine to search the character U-0001 in UTF-8
string, and
>it
>could not find. But when UTF-8 is converted into UTF-16, it can found it
there
>because and are converted into U-0001000 in UTF-16.
Eh? Whatever on earth are
> From: Jianping Yang [mailto:[EMAIL PROTECTED]]
> This will fix the following problem for example:
> For a searching engine to search the character U-0001 in
> UTF-8 string, and it
> could not find. But when UTF-8 is converted into UTF-16, it
> can found it there
> because and are con
ideas like Microsoft's version of Java have died. Lets hope that UTF-8s
also dies.
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, June 07, 2001 10:52 PM
To: [EMAIL PROTECTED]
Subject: Re: UTF-8 syntax
On
t;
> -Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Jianping Yang
> Sent: Thursday, June 07, 2001 6:51 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: UTF-8 syntax
>
> I don't get point from this arg
Ken,
>From your analysis, it make me more believe that we need a UTF-8S not only for the
binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As
proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same
property which is different from UTF-8 and U
---
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Jianping Yang
Sent: Thursday, June 07, 2001 6:51 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: UTF-8 syntax
I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
in
UTF-16 code unit which means on
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> The defintions have problems that need to be fixed, though,
> and they're
> less clear for UTF-16 than they are for UTF-8. I'm becoming
> inclined to say
> that any argumentation for or against UTF-8s on the basis of
> whether it
> runs i
On 08/06/2001 04:22:15 Kenneth Whistler wrote:
[...]
| I think the reason you are not following the argument that Doug and Peter
| have been presenting is that you are thinking in terms of a UTF-8s to
| UTF-16 converter, instead of thinking of the UTF's as they are defined
| in relation to scal
Peter,
I haven't been able to follow all of the postings
today.* I disagree with the conclusion you draw -
The UTF-8 Corrigendum
(http://www.unicode.org/unicode/reports/tr27/#conformance)
is pretty explicit on that (D36), and is the
result of long debate within the UTC. After all,
if there were
On 06/07/2001 09:37:45 PM Peter Constable wrote:
>>So if you are saying there is ambiguous in
>>UTF-8S, it should also apply to UTF-16, which does not make sense to me.
>
>You know what? After all my harping, you're absolutely right on that
point.
I'm starting to wonder if I wasn't thinking thi
Jianping,
> I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
> UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
> two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> UTF-8S, it should also apply to UTF-16, which d
I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
UTF-8S, it should also apply to UTF-16, which does not make sense
On 06/07/2001 08:50:37 PM Jianping Yang wrote:
>I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
in
>UTF-16 code unit which means one UTF-16 code unit will be mapped to either
one,
>two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
>UTF-8S, it sh
On 06/07/2001 10:38:15 AM DougEwell2 wrote:
>The ambiguity comes from the fact that, if I am using UTF-8s and I want to
>represent the sequence of (invalid) scalar values , I must use
the
>UTF-8s sequence , and if I want to represent the
(valid)
>scalar value <1>, I must *also* use the UTF-8
In a message dated 2001-06-07 1:03:04 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
> >But definition D29 says that a UTF must round-trip these invalid code
> points,
> >so we have no choice but to interpret them as . That is why
> >UTF-8s is ambiguous. The sequence could be mapped as
EMAIL PROTECTED]'
Subject: RE: UTF-8 syntax
Peter Constable wrote:
> As I mentioned in an earlier message, the definitions in
> Unicode are less explicit when it comes to interpretation
> than they are with regard to encoding.
Perhaps we see interpretation rules unclear NOW, because
On 06/07/2001 05:17:37 AM Marco Cimarosti wrote:
>So, is NOT a six-byte sequence: it is two adjacent
>THREE-byte sequences: and , and the meaning of these
>sequences is already clear enough by the rules (Table 3.1B): the first one
>means U+D800 and the second one means U+DC00.
Yes, and those
On 06/07/2001 02:32:45 AM Peter Constable wrote:
>We
>are left to infer that "mapped back" means the exact inverse of the
mapping
>defined (in the case of UTF-8) in D36. But note: making that inference
>assumes that the mapping in D36 is invertible. That requires that the
>mapping in D36 is inje
Peter Constable wrote:
> As I mentioned in an earlier message, the definitions in
> Unicode are less explicit when it comes to interpretation
> than they are with regard to encoding.
Perhaps we see interpretation rules unclear NOW, because this discussion has
been mixing up UTF-8 and UTF-8s.
Th
[ copying to unicoRe as I think there are concerns relevant regarding poor
handling of the definitions in TUS, and more importantly some problems with
the definitions ]
On 06/07/2001 12:34:49 AM DougEwell2 wrote:
>But definition D29 says that a UTF must round-trip these invalid code
points,
>
In a message dated 2001-06-06 9:35:45 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
> we see that Unicode does not *exclude* D800 and DC00 from the
> codespace for the CCS, and therefore it would seem that that UTF-8 sequence
> would have to be interpreted (in the encoding form level of in
On 06/06/2001 17:20:50 Peter Constable wrote:
> >Peter Constable replied:
> >> That has to do with XML conformance, not Unicode. You were
> >> looking in the wrong spec.
> >
> >I did not grasp that Mark was talking about XML
>
> I made a wrong assumption about what Mark was meaning. He used "str
> U = (C(subscript: H) ? D800(subscript: 16)) * 400(subscript: 16) + (C
> (subscript: L) ? DC00(subscript: 16)) + 1(subscript: 16)
That didn't survive very well did it. I think you probably got the gist.
The "?" are supposed to be minus signs. I didn't explain that CL and CH are
low- and h
>Peter Constable replied:
>> That has to do with XML conformance, not Unicode. You were
>> looking in the wrong spec.
>
>I did not grasp that Mark was talking about XML
I made a wrong assumption about what Mark was meaning. He used "strict" in
a way that I don't really see supported in the defin
I (Marco Cimarosti) asked:
> 2) According to the Unicode Standard (with no higher-level
> protocols in
> action), what code point(s) correspond(s) to the sequence of
> UTF-32BE octets
> <00 00 00 00 D8 00 00 00 00 00 DC 00>:
> A) ?
> B) ?
>
Ooops! Too many octets. That would rather
I (Marco Cimarosti) asked:
> >1) The difference between "lenient" vs. "strict" parsers.
Mark Davis replied:
> 1. By strict, I meant "excludes irregular sequences"
Peter Constable replied:
> That has to do with XML conformance, not Unicode. You were
> looking in the wrong spec.
I did not grasp
y, June 05, 2001 10:31
Subject: Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))
>
> >I am a little bit confused. I re-read conformance rules and the UTF-8
> >Corrigendum, and I could find these two things:
> >
> >1) The difference between "leni
[EMAIL PROTECTED] scripsit:
> XML requires (recommends?) data to be
> normalised in normal form C. That imposes private (well, open actually, but
> private in the sense of limited to that protocol) constraints against
> otherwise legal Unicode character sequences.
Currently it neither requires n
>I am a little bit confused. I re-read conformance rules and the UTF-8
>Corrigendum, and I could find these two things:
>
>1) The difference between "lenient" vs. "strict" parsers.
That has to do with XML conformance, not Unicode. You were looking in the
wrong spec.
>2) The rule that an UTF-8
Mark Davis wrote:
> It is either one code point (lenient parser) or an error
> (strict parser). It is never two.
I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:
1) The difference between "lenient" vs. "strict" parsers.
2) Th
53 matches
Mail list logo