Re: Combining Class of Thai Nonspacing_Marks

Asmus Freytag (c) Wed, 05 Apr 2017 02:56:34 -0700

On 4/4/2017 8:00 PM, Gerriet M. Denkmann wrote:

On 4 Apr 2017, at 00:00,Asmus Freytag <[email protected]> wrote:


It is not possible to construct a set of secure network identifiers based on 
simply
a) ensuring the string is in NFC
b) otherwise allowing all of the Thai characters (insofar as the they are 
PVALID in IDNA 2008 [RFC5892]).

Considerable attention to allowable contexts is required. There is a group in 
Thailand working on this, but their results have not yet been made public.

Maybe this: Proposal for the Thai Script Root Zone Label Generation Rulesets 
<https://www.icann.org/en/system/files/files/proposal-thai-lgr-15dec16-en.pdf>

Just as long as you understand that it's not final, even for the problemdomain it intends to address.


But the rules for Root Zone Labels are (rightly) much more restricted than what 
I want:

One key difference is that the rules define a preferred ordering, and donot define a folding. Obviously, knowing a preferred ordering allowsanyone to define a folding that results in that ordering.

Another generic difference between an LGR for network identifiers (RootZone or otherwise) and filenames is that and LGR will tend to disallowpathological combinations, even if they are in an unambiguous order."Pathological" combinations are those that result in unpredictablerendering - not just for a few isolated fonts, but across the board.

I would argue that for complex scripts, there may be a case forrestricting filenames in a similar manner: expecting that any randomcombining sequence of unbounded length (up to the full filename) shouldbe supported will surely lead to filenames that are impossible to tellapart; usually because they either do not get rendered in a sensibleway, or things get clipped.


This may even be the case for combining sequences in general.

LGRs, and the Root Zone LGR in particular, go one step further: theytend to explicitly excluded characters that are obsolete, rare,historic, special use, and so on; this is done for two main reasons: tokeep the resulting names recognizable to the majority of users and toavoid the kinds of problems introduced by these characters.

For example, for Arabic, the consensus seems to be that for domainnames, one really doesn't want to support the combining marks. They arenot needed there, unlike general text, and only lead to a bewilderinghost of non-normalizable dual representations, for which otherwise afolding would have to be defined.

Finally, LGRs have some features that go beyond having a clean andfocused repertoire and a defined ordering: those are the cases where twostrings look identical, but neither can be construed as "preferred". Inan LGR these strings can be made "mutually exclusive" using the blockedvariant mechanism (see RFC 7940). Some file systems have rudimentaryforms of this, for example those that are case-preserving but notcase-sensitive. Once a filename is used, its "variant" can no longer beadded, but there's no a-priori folding into a preferred form.

Other than performance, perhaps, there's no reason a file system's validfile name space couldn't be described via RFC 7940. (Even with the fullfeatures of RFC 7940, collision checking can be implemented as an O(1)process for each new file name to be added to a folder). In addition toNFC, some additional foldings might be supplied to transform user inputto valid file names (from case folding to some more complex folding likethe one you are discussing). Like case-insensitive, non-preserving filesystems, adding such foldings would return file names that can bedifferent from the ones the user specified.

Again, whether or not you supply a folding is separate from defining apreferred ordering. For the latter, you might start with the work theThai Generation Panel has been doing, so that valid network identifierscan immediately be valid file names.

A./


Any two strings which look (almost?) identical should be normalised into some 
canonical form.
Reason: not to have identical looking filenames in a filesystem.
With the current rules of normalisation there could be 8 different filenames 
all looking identical to “กินครึ่งทิ้งครึ่ง”.

E.g. :
- both NIKHAHIT + Sara Aa  and Sara Am should be normalised into the same 
string (whatever this is)
- both top-vowel + tone-mark and  tone-mark + top-vowel should be normalised 
into the same string (whatever this is).
etc.

If, as Richard Wordingham wrote: "Unicode combining classes cannot be changed.  
All that can be done is
to enforce the order of characters in normalised text.” then the Unicode 
Normalisation algorithms should be updated.


Kind regards,

Gerriet.

Re: Combining Class of Thai Nonspacing_Marks

Reply via email to