I am not sure yet how far I want to get into this discussion... but this seems worth mentioning:
Asmus Freytag wrote:
The ideal case is one where the converter stops in a restartable
configuration, allowing the client to implement (or ask for) a variety
of error-recovery options.
A nice
(was: Re: Unicode 4.0 BETA available for
review)
At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
C12a When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it
shall treat ill-formed code unit sequences as an error
condition
]; Kent Karlsson
[EMAIL PROTECTED]; 'Michael (michka) Kaplan' [EMAIL PROTECTED]
Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, March 02, 2003 21:10
Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
At 07:21 AM 3/2/03 -0800, Mark Davis wrote
At 11:52 AM 3/3/03 -0800, Mark Davis wrote:
Perhaps I wasn't clear; I agree with you on that.
1) It is conformant to skip or substitute text, with just a code at the end
indicating that something of that sort was done.
It's a subtle point, but can be put into your formulation:
What I was after
(michka) Kaplan' [EMAIL PROTECTED]
Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, March 03, 2003 11:21
Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
But, formally speaking, is it conformant for an API to not stop, and
merely
raise an error
: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, March 03, 2003 12:17
Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
At 11:52 AM 3/3/03 -0800, Mark Davis wrote:
Perhaps I wasn't clear; I agree with you on that.
1) It is conformant to skip
At 01:07 PM 3/3/03 -0800, Mark Davis wrote:
If your converter purports to produce any one of the Unicode encoding forms,
then it cannot conformantly produce malformed Unicode as a result.
If, of course, it does not purport to do that, it can do anything it wants
to.
Then, as long as the
Michael (michka) Kaplan:
...
then the conversion will simply strip the errant characters. Note that
either solution meets the needs of refusal to interpret the errant
sequences.
Simply stripping the errant byte sequences means that they are
each interpreted as the empty string of characters.
Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
Michael (michka) Kaplan:
...
then the conversion will simply strip the errant characters. Note that
either solution meets the needs of refusal to interpret the errant
sequences.
Simply stripping the errant
From: Mark Davis [EMAIL PROTECTED]
I agree with Kent that it is somewhat less robust to simply remove
ill-formed sequences, since it removes any indication that the data
was
corrupted.
Nice that the API gives one the option to choose, huh? ;-)
The notion of continuing (even if one is
At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
C12a When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it
shall treat ill-formed code unit sequences as an error
condition, and shall not interpret such sequences as
Kenneth Whistler wrote:
Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not
Frank Tang responded to Kent Karlsson's response:
The problem I need to deal with is not GENERATE those UTF-8, but how to
handle these DATA when my code receive it. For example, when I receive a
10K UTF-8 file which have 1000 lines of text, if there are one UTF-8
sequence in the line 990
Ken,
Hmm, is that true? Is it ok then, if I detect an unpaired surrogate, mutter
oops I have an error and then drop that surrogate and continue processing
the file, resulting in a valid utf-8 file?
I thought for some reason this was prohibited, but if the standard does not
prescribe error
Tex Texin asked:
Hmm, is that true?
Yes, it is true. All the standard *mandates* is what I quoted
previously in this thread:
C12a When a process interprets a code unit sequence which purports
to be in a Unicode character encoding form, it shall treat
ill-formed code unit sequences
Kenneth Whistler wrote:
Yes, it is true. All the standard *mandates* is what I quoted
previously in this thread:
C12a When a process interprets a code unit sequence which purports
to be in a Unicode character encoding form, it shall treat
ill-formed code unit sequences as an
16 matches
Mail list logo