Fw: UTF-8 Syntax

2001-06-13 Thread Mark Davis
- Original Message - From: "Mark Davis" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, June 11, 2001 08:06 Subject: Re: UTF-8 Syntax > I would rank the situations, from a interpreter's point of view, in the > followi

RE: UTF-8 Syntax

2001-06-12 Thread toby_phipps
Marco Cimarosti <[EMAIL PROTECTED]> wrote: >My assumption was that, in the first case (no sort order requested by the >client), a server could in theory provide a result set randomly shuffled. Of >course, I know that this won't normally happen but, however, the server is >allowed to provide whate

RE: UTF-8 Syntax

2001-06-11 Thread Carl W. Brown
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Monday, June 11, 2001 2:33 AM To: [EMAIL PROTECTED] Subject: RE: UTF-8 Syntax >Carl W. Brown <[EMAIL PROTECTED]> wrote: >>In the case of strcmp the problem is

Re: UTF-8 Syntax

2001-06-11 Thread Antoine Leca
[EMAIL PROTECTED] wrote: > > Carl W. Brown <[EMAIL PROTECTED]> wrote: > >In the case of strcmp the problem is that this won't even work on UCS-2. > >It detects the end of string with a single byte 0x00. You have to use a > >special Unicode compare routine this routine needs to be fixed to produc

Re: UTF-8 syntax

2001-06-11 Thread Antoine Leca
Jianping Yang wrote: > > [UTF-8S] will fix the following problem for example: > For a searching engine to search the character U-0001 in UTF-8 string, and it > could not find. But when UTF-8 is converted into UTF-16, it can found it there > because and are converted into U-0001000 in UTF-

[OT] RE: UTF-8 Syntax

2001-06-11 Thread Peter_Constable
>Yes, all these uses may be internal to each >vendor, but as Uma has stated, internal representation leaks out. If any >significant number of vendors are going to be using this encoding >internally in their systems, wouldn't it make sense to have a UTR >describing what this representation is, wh

RE: UTF-8 Syntax

2001-06-11 Thread Marco Cimarosti
Toby Phipps wrote: > This is incredibly inefficient, not only > because significant amounts of temporary space needs to be > allocated and > freed, but also because the entire result set of the query has to be > processed and sorted before the first row is returned. With > result sets > involvi

RE: UTF-8 Syntax

2001-06-11 Thread Marco Cimarosti
Toby Phipps wrote: > [...] There's been a lot of talk about the UTF-8S > proposal on both the unicode and unicore list, so please > forgive me (and > notify me if you feel the need) if I have missed any of the > salient points that require a response. You made a very clever summary of such a lo

RE: UTF-8 Syntax

2001-06-11 Thread toby_phipps
Carl W. Brown <[EMAIL PROTECTED]> wrote: >In the case of strcmp the problem is that this won't even work on UCS-2. It >detects the end of string with a single byte 0x00. You have to use a >special Unicode compare routine this routine needs to be fixed to produce >proper compares. Most likely yo

RE: UTF-8 Syntax

2001-06-10 Thread Carl W. Brown
o:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Friday, June 08, 2001 4:11 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: UTF-8 Syntax As one of the proponents of the UTF-8S proposal, I feel compelled to respond to some of the recent comments regarding the proposal on the u

Re: UTF-8 syntax

2001-06-10 Thread Lars Marius Garshol
* Lars Marius Garshol | | It seems to me that a lenient search engine, since it searches in an | index it has built for itself, would turn the UTF-8 it indexes into | a canonical form (say ). It would then canonicalize any | strings it is asked to search for into the same form, regardless of | wh

Re: UTF-8 syntax

2001-06-09 Thread Kenneth Whistler
Lars M. responded: > | > | A *lenient* search engine could also search for the irregular > | pattern, i.e., it could consider and | 80> both to be matches for U-0001, but that would slow it down. > > It seems to me that a lenient search engine, since it searches in an > index it has built

Re: UTF-8 syntax

2001-06-08 Thread Peter_Constable
On 06/08/2001 07:26:54 PM Jianping Yang wrote: >The issue comes from unpaired surrogates as and can be >in UTF-8 and your search for (which is Unicode scalar value >U-0001) cannot find it. But however, when the UTF-8 string converted into >UTF-16, and will become >, and you can find t

Re: UTF-8 syntax

2001-06-08 Thread Jianping Yang
Ken, Thanks, your comment could close this argument against UTF-8S syntax as the attack here is groundless now, because there is no need to encoding and as separate *paired* surrogates in UTF-8S and they will always be converted into 0x1 in UTF-32 or in UTF-8. So there is no ambiguity an

Re: UTF-8 syntax

2001-06-08 Thread Kenneth Whistler
Jianping said: > The issue comes from unpaired surrogates as and These are not *unpaired* surrogates -- they are *paired* surrogates. Else your equating them to or U-0001 would make no sense. > can be > in UTF-8 They cannot be in well-formed UTF-8. They can only be in ill-formed UTF-8

Re: UTF-8 syntax

2001-06-08 Thread Lars Marius Garshol
* Kenneth Whistler | | The problem comes when someone, contrary to the conformance | requirements of the standard, has emitted irregular UTF-8 for the | character in question, so that instead of , the string | has in it. | | A *lenient* search engine could also search for the irregular | patte

RE: UTF-8 syntax

2001-06-08 Thread Ayers, Mike
> From: Jianping Yang [mailto:[EMAIL PROTECTED]] > The issue comes from unpaired surrogates as and > can be > in UTF-8 and your search for (which is Unicode > scalar value > U-0001) cannot find it. This is good, because is U-d800 and is U-dc00, so they should not mat

Re: UTF-8 syntax

2001-06-08 Thread Jianping Yang
The issue comes from unpaired surrogates as and can be in UTF-8 and your search for (which is Unicode scalar value U-0001) cannot find it. But however, when the UTF-8 string converted into UTF-16, and will become , and you can find the same character by searching in UTF-16. Unless this

Re: UTF-8 Syntax

2001-06-08 Thread Rick McGowan
Toby, I think you forgot to comment on these objections that have also been coming up from time to time: * Introduction of UTF-8S would merely add to the myriad forms people would already have to support, and it is insufficiently distinuguishable from UTF-8. * encoding ambiguities in the s

Re: UTF-8 Syntax

2001-06-08 Thread toby_phipps
As one of the proponents of the UTF-8S proposal, I feel compelled to respond to some of the recent comments regarding the proposal on the unicode and unicore lists. Although there have been some good comments about how the goals of the proposal could be accomplished without a new encoding form, t

Re: UTF-8 syntax

2001-06-08 Thread Peter_Constable
On 06/08/2001 01:33:16 PM Jianping Yang wrote: 10 lines of new text, and 141 lines of quoted text without any comments interspersed. Please edit your responses. - Peter --- Peter Constable Non-Roman Script Initiative,

RE: UTF-8 syntax

2001-06-08 Thread Peter_Constable
>Peter, > >There is a standard Unicode sort order, the code point sort order. This >proposal calls for establishing and alternate code point order by >establishing a new set of encoding schemes. Yes, I'm very aware of what it is calling for. (You've missed my voluminous comments against it?) M

Re: UTF-8 syntax

2001-06-08 Thread Kenneth Whistler
Jianping wrote: > From your analysis, it make me more believe that we need a UTF-8S not only for the > binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As > proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same > property which is differen

Re: UTF-8 syntax

2001-06-08 Thread Peter_Constable
>This will fix the following problem for example: >For a searching engine to search the character U-0001 in UTF-8 string, and >it >could not find. But when UTF-8 is converted into UTF-16, it can found it there >because and are converted into U-0001000 in UTF-16. Eh? Whatever on earth are

RE: UTF-8 syntax

2001-06-08 Thread Ayers, Mike
> From: Jianping Yang [mailto:[EMAIL PROTECTED]] > This will fix the following problem for example: > For a searching engine to search the character U-0001 in > UTF-8 string, and it > could not find. But when UTF-8 is converted into UTF-16, it > can found it there > because and are con

RE: UTF-8 syntax

2001-06-08 Thread Carl W. Brown
ideas like Microsoft's version of Java have died. Lets hope that UTF-8s also dies. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Thursday, June 07, 2001 10:52 PM To: [EMAIL PROTECTED] Subject: Re: UTF-8 syntax On

Re: UTF-8 syntax

2001-06-08 Thread Jianping Yang
t; > -Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Jianping Yang > Sent: Thursday, June 07, 2001 6:51 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: UTF-8 syntax > > I don't get point from this arg

Re: UTF-8 syntax

2001-06-08 Thread Jianping Yang
Ken, >From your analysis, it make me more believe that we need a UTF-8S not only for the binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same property which is different from UTF-8 and U

RE: UTF-8 syntax

2001-06-08 Thread Carl W. Brown
--- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jianping Yang Sent: Thursday, June 07, 2001 6:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: UTF-8 syntax I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in UTF-16 code unit which means on

RE: UTF-8 syntax

2001-06-08 Thread Ayers, Mike
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > The defintions have problems that need to be fixed, though, > and they're > less clear for UTF-16 than they are for UTF-8. I'm becoming > inclined to say > that any argumentation for or against UTF-8s on the basis of > whether it > runs i

Re: UTF-8 syntax

2001-06-08 Thread Misha Wolf
On 08/06/2001 04:22:15 Kenneth Whistler wrote: [...] | I think the reason you are not following the argument that Doug and Peter | have been presenting is that you are thinking in terms of a UTF-8s to | UTF-16 converter, instead of thinking of the UTF's as they are defined | in relation to scal

RE: UTF-8 syntax

2001-06-08 Thread Mark Davis
Peter, I haven't been able to follow all of the postings today.* I disagree with the conclusion you draw - The UTF-8 Corrigendum (http://www.unicode.org/unicode/reports/tr27/#conformance) is pretty explicit on that (D36), and is the result of long debate within the UTC. After all, if there were

Re: UTF-8 syntax

2001-06-08 Thread Peter_Constable
On 06/07/2001 09:37:45 PM Peter Constable wrote: >>So if you are saying there is ambiguous in >>UTF-8S, it should also apply to UTF-16, which does not make sense to me. > >You know what? After all my harping, you're absolutely right on that point. I'm starting to wonder if I wasn't thinking thi

Re: UTF-8 syntax

2001-06-07 Thread Kenneth Whistler
Jianping, > I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in > UTF-16 code unit which means one UTF-16 code unit will be mapped to either one, > two, or three bytes in UTF-8S. So if you are saying there is ambiguous in > UTF-8S, it should also apply to UTF-16, which d

Re: UTF-8 syntax

2001-06-07 Thread Jianping Yang
I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in UTF-16 code unit which means one UTF-16 code unit will be mapped to either one, two, or three bytes in UTF-8S. So if you are saying there is ambiguous in UTF-8S, it should also apply to UTF-16, which does not make sense

Re: UTF-8 syntax

2001-06-07 Thread Peter_Constable
On 06/07/2001 08:50:37 PM Jianping Yang wrote: >I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in >UTF-16 code unit which means one UTF-16 code unit will be mapped to either one, >two, or three bytes in UTF-8S. So if you are saying there is ambiguous in >UTF-8S, it sh

Re: UTF-8 syntax

2001-06-07 Thread Peter_Constable
On 06/07/2001 10:38:15 AM DougEwell2 wrote: >The ambiguity comes from the fact that, if I am using UTF-8s and I want to >represent the sequence of (invalid) scalar values , I must use the >UTF-8s sequence , and if I want to represent the (valid) >scalar value <1>, I must *also* use the UTF-8

Re: UTF-8 syntax

2001-06-07 Thread DougEwell2
In a message dated 2001-06-07 1:03:04 Pacific Daylight Time, [EMAIL PROTECTED] writes: > >But definition D29 says that a UTF must round-trip these invalid code > points, > >so we have no choice but to interpret them as . That is why > >UTF-8s is ambiguous. The sequence could be mapped as

RE: UTF-8 syntax

2001-06-07 Thread Carl W. Brown
EMAIL PROTECTED]' Subject: RE: UTF-8 syntax Peter Constable wrote: > As I mentioned in an earlier message, the definitions in > Unicode are less explicit when it comes to interpretation > than they are with regard to encoding. Perhaps we see interpretation rules unclear NOW, because

RE: UTF-8 syntax

2001-06-07 Thread Peter_Constable
On 06/07/2001 05:17:37 AM Marco Cimarosti wrote: >So, is NOT a six-byte sequence: it is two adjacent >THREE-byte sequences: and , and the meaning of these >sequences is already clear enough by the rules (Table 3.1B): the first one >means U+D800 and the second one means U+DC00. Yes, and those

Re: UTF-8 syntax

2001-06-07 Thread Peter_Constable
On 06/07/2001 02:32:45 AM Peter Constable wrote: >We >are left to infer that "mapped back" means the exact inverse of the mapping >defined (in the case of UTF-8) in D36. But note: making that inference >assumes that the mapping in D36 is invertible. That requires that the >mapping in D36 is inje

RE: UTF-8 syntax

2001-06-07 Thread Marco Cimarosti
Peter Constable wrote: > As I mentioned in an earlier message, the definitions in > Unicode are less explicit when it comes to interpretation > than they are with regard to encoding. Perhaps we see interpretation rules unclear NOW, because this discussion has been mixing up UTF-8 and UTF-8s. Th

Re: UTF-8 syntax

2001-06-07 Thread Peter_Constable
[ copying to unicoRe as I think there are concerns relevant regarding poor handling of the definitions in TUS, and more importantly some problems with the definitions ] On 06/07/2001 12:34:49 AM DougEwell2 wrote: >But definition D29 says that a UTF must round-trip these invalid code points, >

Re: UTF-8 syntax

2001-06-06 Thread DougEwell2
In a message dated 2001-06-06 9:35:45 Pacific Daylight Time, [EMAIL PROTECTED] writes: > we see that Unicode does not *exclude* D800 and DC00 from the > codespace for the CCS, and therefore it would seem that that UTF-8 sequence > would have to be interpreted (in the encoding form level of in

RE: UTF-8 syntax

2001-06-06 Thread Misha Wolf
On 06/06/2001 17:20:50 Peter Constable wrote: > >Peter Constable replied: > >> That has to do with XML conformance, not Unicode. You were > >> looking in the wrong spec. > > > >I did not grasp that Mark was talking about XML > > I made a wrong assumption about what Mark was meaning. He used "str

RE: UTF-8 syntax

2001-06-06 Thread Peter_Constable
> U = (C(subscript: H) ? D800(subscript: 16)) * 400(subscript: 16) + (C > (subscript: L) ? DC00(subscript: 16)) + 1(subscript: 16) That didn't survive very well did it. I think you probably got the gist. The "?" are supposed to be minus signs. I didn't explain that CL and CH are low- and h

RE: UTF-8 syntax

2001-06-06 Thread Peter_Constable
>Peter Constable replied: >> That has to do with XML conformance, not Unicode. You were >> looking in the wrong spec. > >I did not grasp that Mark was talking about XML I made a wrong assumption about what Mark was meaning. He used "strict" in a way that I don't really see supported in the defin

RE: UTF-8 syntax

2001-06-06 Thread Marco Cimarosti
I (Marco Cimarosti) asked: > 2) According to the Unicode Standard (with no higher-level > protocols in > action), what code point(s) correspond(s) to the sequence of > UTF-32BE octets > <00 00 00 00 D8 00 00 00 00 00 DC 00>: > A) ? > B) ? > Ooops! Too many octets. That would rather

RE: UTF-8 syntax

2001-06-06 Thread Marco Cimarosti
I (Marco Cimarosti) asked: > >1) The difference between "lenient" vs. "strict" parsers. Mark Davis replied: > 1. By strict, I meant "excludes irregular sequences" Peter Constable replied: > That has to do with XML conformance, not Unicode. You were > looking in the wrong spec. I did not grasp

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Mark Davis
y, June 05, 2001 10:31 Subject: Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)) > > >I am a little bit confused. I re-read conformance rules and the UTF-8 > >Corrigendum, and I could find these two things: > > > >1) The difference between "leni

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread John Cowan
[EMAIL PROTECTED] scripsit: > XML requires (recommends?) data to be > normalised in normal form C. That imposes private (well, open actually, but > private in the sense of limited to that protocol) constraints against > otherwise legal Unicode character sequences. Currently it neither requires n

Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Peter_Constable
>I am a little bit confused. I re-read conformance rules and the UTF-8 >Corrigendum, and I could find these two things: > >1) The difference between "lenient" vs. "strict" parsers. That has to do with XML conformance, not Unicode. You were looking in the wrong spec. >2) The rule that an UTF-8

UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

2001-06-05 Thread Marco Cimarosti
Mark Davis wrote: > It is either one code point (lenient parser) or an error > (strict parser). It is never two. I am a little bit confused. I re-read conformance rules and the UTF-8 Corrigendum, and I could find these two things: 1) The difference between "lenient" vs. "strict" parsers. 2) Th