The full excerpt from the UTS reads:

Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD <http://www.unicode.org/reports/tr39/#UCD>].

 1. The criterion can only be applied if the language of the string is
    known to be Chinese. So, for example, the string “写真だけの結婚式 ”
    is Japanese, and should not be marked as mixed script because of a
    mixture of S and T characters.
 2. Testing for whether a character is S or T needs to be based not on
    whether the character /has/ a S or T variant , but whether the
    character /is/ an S or T variant.


There are several issues with this.

First and foremost, the definition of S and T variants is not something that is universally agreed upon. The .cn, .hk or .tw registries are using a definition of S and T variants that does not agree with the Unihan data in many particulars. Therefore, using the Unihan data would result in false positives. (And false negatives).

Second, there are many characters that are variants that are acceptable with both "S" or "T" labels. You only have to look at the published Label Generation Rulesets (or IDN tables) for these domains to see many examples. And, as mentioned above, you cannot reverse engineer these tables from Unihan data.

Third, the same domains mentioned have a policy of delegating up to three label to the same applicant: a "traditional", "simplified" and a mixed label matching the spelling of the label in the original application (for situations where a mixed label is appropriate). In other words, certain mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing any other mixed label that is a variant of the three from being allocated to an unrelated party. If you "know" that the language has to be Chinese, because the domain is a ccTLD, then at the same time the check is superfluous. Other registries are not known to have similar policies, so for them additional spoof detection may be useful --- however it is precisely those cases where it's impossible to know whether a label is intended to be in the Chinese language.

Fifth, generally the only thing that can be ascertained is that a label is *not* in Chinese: by virtue of having Kana or Hangul characters mixed in. However, the reverse is not true. You will find labels registered under .jp that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the state of the art is to have a coordinated policy that prevents "random" variant labels from coexisting in the registry. An example of this kind of effort is being developed for the root zone. By definition, for the root zone, there is no implied information about the language context, unlike the case for the second level, where the presence of a ccTLD in the full domain name may give a clue.

Seventh, attempting to determine whether a label is potentially valid based on variant data (of any kind) is doomed, because actual usage is not limited to "pure" labels. The variant mechanism is something that works differently (in those registries that apply it): instead of looking at a single label, the registry can implement "mutual exclusion". Once one variant label from a given set has been delegated, all others are excluded (or in practice, all but three, which are limited to the same applicant). Without access to the registry data, you cannot predict which variants in a set are the "good ones", and with access to the data, spoof labels are rejected and cannot be registered.

In conclusion, my recommendation would be to retract this particular passage.

A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:
In UTS 39, it says, that optionally,

"Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD].

"The criterion can only be applied if the language of the string is known to be Chinese."

What does it mean for the language to "be known to be Chinese"? Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD?

The example given shows some Hirigana in the text.  That clearly indicates the language isn't Chinese.  So in this example we can algorithmically rule out that its Chinese.

And what does Chinese really mean here?



Reply via email to