Re: Traditional and Simplified Han in UTS 39

Asmus Freytag via Unicode Wed, 27 Dec 2017 21:28:42 -0800

The full excerpt from the UTS reads:

Mark Chinese strings as “mixed script” if they contain both simplified(S) and traditional (T) Chinese characters, using the Unihan data inthe Unicode Character Database [UCD<http://www.unicode.org/reports/tr39/#UCD>].
 1. The criterion can only be applied if the language of the string is
    known to be Chinese. So, for example, the string “写真だけの結婚式 ”
    is Japanese, and should not be marked as mixed script because of a
    mixture of S and T characters.
 2. Testing for whether a character is S or T needs to be based not on
    whether the character /has/ a S or T variant , but whether the
    character /is/ an S or T variant.


There are several issues with this.

First and foremost, the definition of S and T variants is not somethingthat is universally agreed upon. The .cn, .hk or .tw registries areusing a definition of S and T variants that does not agree with theUnihan data in many particulars. Therefore, using the Unihan data wouldresult in false positives. (And false negatives).

Second, there are many characters that are variants that are acceptablewith both "S" or "T" labels. You only have to look at the publishedLabel Generation Rulesets (or IDN tables) for these domains to see manyexamples. And, as mentioned above, you cannot reverse engineer thesetables from Unihan data.

Third, the same domains mentioned have a policy of delegating up tothree label to the same applicant: a "traditional", "simplified" and amixed label matching the spelling of the label in the originalapplication (for situations where a mixed label is appropriate). Inother words, certain mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing anyother mixed label that is a variant of the three from being allocated toan unrelated party. If you "know" that the language has to be Chinese,because the domain is a ccTLD, then at the same time the check issuperfluous. Other registries are not known to have similar policies, sofor them additional spoof detection may be useful --- however it isprecisely those cases where it's impossible to know whether a label isintended to be in the Chinese language.

Fifth, generally the only thing that can be ascertained is that a labelis *not* in Chinese: by virtue of having Kana or Hangul characters mixedin. However, the reverse is not true. You will find labels registeredunder .jp that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the stateof the art is to have a coordinated policy that prevents "random"variant labels from coexisting in the registry. An example of this kindof effort is being developed for the root zone. By definition, for theroot zone, there is no implied information about the language context,unlike the case for the second level, where the presence of a ccTLD inthe full domain name may give a clue.

Seventh, attempting to determine whether a label is potentially validbased on variant data (of any kind) is doomed, because actual usage isnot limited to "pure" labels. The variant mechanism is something thatworks differently (in those registries that apply it): instead oflooking at a single label, the registry can implement "mutualexclusion". Once one variant label from a given set has been delegated,all others are excluded (or in practice, all but three, which arelimited to the same applicant). Without access to the registry data, youcannot predict which variants in a set are the "good ones", and withaccess to the data, spoof labels are rejected and cannot be registered.

In conclusion, my recommendation would be to retract this particularpassage.


A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:

In UTS 39, it says, that optionally,
"Mark Chinese strings as “mixed script” if they contain bothsimplified (S) and traditional (T) Chinese characters, using theUnihan data in the Unicode Character Database [UCD].
"The criterion can only be applied if the language of the string isknown to be Chinese."
What does it mean for the language to "be known to be Chinese"? Isthis something algorithmically determinable, or does it come frominformation about the input text that comes from outside the UCD?
The example given shows some Hirigana in the text. That clearlyindicates the language isn't Chinese. So in this example we canalgorithmically rule out that its Chinese.
And what does Chinese really mean here?

Re: Traditional and Simplified Han in UTS 39

Reply via email to