I had misremembered the CharModel on this point when I wrote: "One would have to have the additional requirement in the Character Model, that any XML parser that converts an XML document from a legacy character set into Unicode is not conformant unless it is (actually) normalizing." There is already that stipulation.
However, the following conditional is pointless. >unless i) a normalizing transcoder cannot exist for that encoding - It is always possible for a normalizing transcoder to exist, since it is always possible to combine a normalizer into a transcoder. - And it is always possible to transcode from any other set into Unicode, using PUA code points in the unusual cases. So taking John's original statement: > > Documents not in UTF-* are normalized by definition, unless it is > > *impossible* to convert them to normalized Unicode (typically > > because they contain characters not yet present in Unicode). According to the CharModel, it should be simplified to: "Documents not in UTF-* are normalized by definition." The point I am concerned about, however, is that all of this seems to "define away" an issue, which is that there are transcoders out in the world that are not normalizing; parsers that use them will not produce the right results unless they normalize the text themselves. Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "François Yergeau" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: "w3c-i18n-ig" <[EMAIL PROTECTED]> Sent: Thursday, February 21, 2002 07:51 Subject: Re: Unicode Search Engines > Mark Davis wrote: > > > Simply saying that a document is "normalized by definition" if it is > > *possible* to convert it to Unicode would ignore reality, since > > converters may not *actually* convert it to normalized Unicode. > > > And consequently that is not what the Character Model says. It says > that legacy data is normalized if it is possible to convert it to > *normalized* Unicode: "unless i) a normalizing transcoder cannot exist > for that encoding". > > > One > > would have to have the additional requirement in the Character Model, > > that any XML parser that converts an XML document from a legacy > > character set into Unicode is not conformant unless it is (actually) > > normalizing. > > > This is what the Character Model actually says: "[I] Implementations > which transcode text data from a legacy encoding to a Unicode encoding > form MUST use a normalizing transcoder." > > > Marco Cimarosti wrote: > >>E.g., ISCII 0xCF + 0xE9 (LETTER RA + SIGN NUKTA) corresponds to Unicode > >>U0930 + U093C (DEVANAGARI LETTER RA + DEVANAGARI SIGN NUKTA), which > >>is not NFC: it should be U0931 (DEVANAGARI LETTER RRA). > >> > >>What should the recipient to when it receives such an ISCII > >>sequence? Refuse > >>it because it is not normalized (ISCII itself also contains 0xD0, > >>LETTER RRA), or "fix" it while converting it to Unicode? > > > Fix it. > > -- > François Yergeau > >

