Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Markus Scherer
I am not sure yet how far I want to get into this discussion... but this seems worth mentioning: Asmus Freytag wrote: The ideal case is one where the converter stops in a restartable configuration, allowing the client to implement (or ask for) a variety of error-recovery options. A nice

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
(was: Re: Unicode 4.0 BETA available for review) At 07:21 AM 3/2/03 -0800, Mark Davis wrote: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
]; Kent Karlsson [EMAIL PROTECTED]; 'Michael (michka) Kaplan' [EMAIL PROTECTED] Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Sunday, March 02, 2003 21:10 Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) At 07:21 AM 3/2/03 -0800, Mark Davis wrote

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
At 11:52 AM 3/3/03 -0800, Mark Davis wrote: Perhaps I wasn't clear; I agree with you on that. 1) It is conformant to skip or substitute text, with just a code at the end indicating that something of that sort was done. It's a subtle point, but can be put into your formulation: What I was after

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
(michka) Kaplan' [EMAIL PROTECTED] Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, March 03, 2003 11:21 Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) But, formally speaking, is it conformant for an API to not stop, and merely raise an error

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, March 03, 2003 12:17 Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) At 11:52 AM 3/3/03 -0800, Mark Davis wrote: Perhaps I wasn't clear; I agree with you on that. 1) It is conformant to skip

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
At 01:07 PM 3/3/03 -0800, Mark Davis wrote: If your converter purports to produce any one of the Unicode encoding forms, then it cannot conformantly produce malformed Unicode as a result. If, of course, it does not purport to do that, it can do anything it wants to. Then, as long as the

RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Kent Karlsson
Michael (michka) Kaplan: ... then the conversion will simply strip the errant characters. Note that either solution meets the needs of refusal to interpret the errant sequences. Simply stripping the errant byte sequences means that they are each interpreted as the empty string of characters.

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Mark Davis
Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) Michael (michka) Kaplan: ... then the conversion will simply strip the errant characters. Note that either solution meets the needs of refusal to interpret the errant sequences. Simply stripping the errant

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Michael \(michka\) Kaplan
From: Mark Davis [EMAIL PROTECTED] I agree with Kent that it is somewhat less robust to simply remove ill-formed sequences, since it removes any indication that the data was corrupted. Nice that the API gives one the option to choose, huh? ;-) The notion of continuing (even if one is

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Asmus Freytag
At 07:21 AM 3/2/03 -0800, Mark Davis wrote: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-28 Thread Yung-Fong Tang
Kenneth Whistler wrote: Think of it this way. Does anyone expect the ASCII standard to tell, in detail, what a process should or should not do if it receives data which purports to be ASCII, but which contains an 0x80 byte in it? All the ASCII standard can really do is tell you that 0x80 is not

UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
Frank Tang responded to Kent Karlsson's response: The problem I need to deal with is not GENERATE those UTF-8, but how to handle these DATA when my code receive it. For example, when I receive a 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 sequence in the line 990

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Tex Texin
Ken, Hmm, is that true? Is it ok then, if I detect an unpaired surrogate, mutter oops I have an error and then drop that surrogate and continue processing the file, resulting in a valid utf-8 file? I thought for some reason this was prohibited, but if the standard does not prescribe error

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
Tex Texin asked: Hmm, is that true? Yes, it is true. All the standard *mandates* is what I quoted previously in this thread: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Tex Texin
Kenneth Whistler wrote: Yes, it is true. All the standard *mandates* is what I quoted previously in this thread: C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an