> Working in Java, which of the commonly supported character encoding might > have non-normalizing transcoders. And, with the transcoders shipped with > Java which are non-normalizing?
Any code page with combining marks will generally not be. So the Unicode encoding forms themselves, ISO Latin Arabic (ISO-8859-6, Windows Thai (cp874), etc. To check converters in detail, the easiest thing to do is to - take each converter; - convert all Unicode characters to it and back (sifting out the rejects); - check the result with QuickCheck as described in TR#15*. - If there are any NO or MAYBEs, then the converter is non-normalizing. * There is an implementation of QuickCheck in ICU4J (http://oss.software.ibm.com/icu/), in the current 2.2 snapshot (2.2 final will be released this summer, but I think the snapshot should do the trick). Mark __________ http://www.macchiato.com ◄ “Eppur si muove” ► ----- Original Message ----- From: "Jeremy Carroll" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, June 24, 2002 01:52 Subject: Normalizing Transcoders > > > > The W3C Character Model Working Draft [1] defines the concept of normalizing > transcoder This is a transcoder from a non-UCS based encoding to Unicode > whose output is in NFC. > > Working in Java, which of the commonly supported character encoding might > have non-normalizing transcoders. And, with the transcoders shipped with > Java which are non-normalizing? > > (e.g. I suspect for ASCII it is impossible to write a non-normalizing > transcoder, I don't know iso-8859-1 backwards, but also get the impression > that all actual transcoders will be normalizing). > > To bound the issue my ambitions do not stretch beyond those encoding > supported by Xerces-J. These are listed as: > > UTF-8 > UTF-16 Big Endian, UTF-16 Little Endian > IBM-1208 > ISO Latin-1 (ISO-8859-1) > ISO Latin-2 (ISO-8859-2) [Bosnian, Croatian, Czech, Hungarian, Polish, > Romanian, Serbian (in Latin transcription), Serbocroatian, Slovak, > Slovenian, Upper and Lower Sorbian] > ISO Latin-3 (ISO-8859-3) [Maltese, Esperanto] > ISO Latin-4 (ISO-8859-4) > ISO Latin Cyrillic (ISO-8859-5) > ISO Latin Arabic (ISO-8859-6) > ISO Latin Greek (ISO-8859-7) > ISO Latin Hebrew (ISO-8859-8) > ISO Latin-5 (ISO-8859-9) [Turkish] > Extended Unix Code, packed for Japanese (euc-jp, eucjis) > Japanese Shift JIS (shift-jis) > Chinese (big5) > Chinese for PRC (mixed 1/2 byte) (gb2312) > Japanese ISO-2022-JP (iso-2022-jp) > Cyrillic (koi8-r) > Extended Unix Code, packed for Korean (euc-kr) > Russian Unix, Cyrillic (koi8-r) > Windows Thai (cp874) > Latin 1 Windows (cp1252) (and all other cp125? encodings recognized by IANA) > cp858 > EBCDIC encodings: > EBCDIC US (ebcdic-cp-us) > EBCDIC Canada (ebcdic-cp-ca) > EBCDIC Netherland (ebcdic-cp-nl) > EBCDIC Denmark (ebcdic-cp-dk) > EBCDIC Norway (ebcdic-cp-no) > EBCDIC Finland (ebcdic-cp-fi) > EBCDIC Sweden (ebcdic-cp-se) > EBCDIC Italy (ebcdic-cp-it) > EBCDIC Spain, Latin America (ebcdic-cp-es) > EBCDIC Great Britain (ebcdic-cp-gb) > EBCDIC France (ebcdic-cp-fr) > EBCDIC Hebrew (ebcdic-cp-he) > EBCDIC Switzerland (ebcdic-cp-ch) > EBCDIC Roece (ebcdic-cp-roece) > EBCDIC Yugoslavia (ebcdic-cp-yu) > EBCDIC Iceland (ebcdic-cp-is) > EBCDIC Urdu (ebcdic-cp-ar2) > Latin 0 EBCDIC > EBCDIC Arabic (ebcdic-cp-ar1) > > Jeremy > > [1] Charmod > http://www.w3.org/TR/charmod > > >

