Paul Hastings wrote: > i suppose this is a really simple minded question but is > there any way of telling if an incoming chunk of text > (say from a browser form) is traditional or simplified > chinese?
Please notice that the classification you want is not always meaningful. E.g., what if the incoming text is in Spanish? Would you classify it as traditional or simplified Chinese?... Anyway. You can obtain the base data for each Chinese character from the file http://www.unicode.org/Public/UNIDATA/Unihan.txt, by checking the existence of fields <kSimplifiedVariant> and <kTraditionalVariant>. Any Unicode character, falls in one of these four categories: 0) All characters not listed in Unihan.txt (i.e., non-Chinese characters) are *neither* "Traditional" nor "Simplified"; 1) All characters having <kSimplifiedVariant> but *no* <kTraditionalVariant> are "Traditional"; 2) All characters having <kTraditionalVariant> but *no* <kSimplifiedVariant> are "Simplified"; 3) All other characters listed in Unihan.txt are *both* "Traditional" and "Simplified". >From these character-level categories, you can assign a category to the input stream: If at least one character has category 1 AND at least one character has category 2, then: stream is both "Traditional" and "Simplified (category 3); Else, if at least one character has category 1, then: stream is "Traditional" (category 1); Else, if at least one character has category 2, then: stream is "Simplified" (category 2); Else, if at least one character has category 3: stream is both "Traditional" and "Simplified (category 3 again); Else (all characters have category 0, then): stream is neither "Traditional" nor "Simplified (category 0); End. Anyway, I don't see how this information could be of any use for any purpose... _ Marco

