Dominikus Scherkl wrote:
Converting from and to utf-8 is an all-day topic, very important
for all applications handling with unicode. So it is a special
Converting text to/from UTF-8 is indeed common and important.

Converting text that claims to be UTF-8 - but isn't - is different: It may be a spoofing attempt, or bytes may have been lost, or the text may not be UTF-8 at all, etc. How to handle non-UTF-8 text in a from-UTF-8 converter seems to be a judgement call, and application-specific.
(How does the converter know _why_ there is an illegal sequence?)

Additional I think we should have a standardized way to display
old utf-8 text without losing information (overlong utf-8 was
allowed for years) ...
ISO 10646 and the RFC never allowed to generate overlong UTF-8. Unicode at least used to say "should not" for generation (but allowed decoding). Chances are nearly 100% that overlong UTF-8 was a spoofing attempt, or the result of something other than a UTF-8 encoder.

Viele Gr��e,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.




Reply via email to