On 4/4/2017 8:00 PM, Gerriet M. Denkmann wrote:
On 4 Apr 2017, at 00:00,Asmus Freytag <[email protected]> wrote:

It is not possible to construct a set of secure network identifiers based on 
simply
a) ensuring the string is in NFC
b) otherwise allowing all of the Thai characters (insofar as the they are 
PVALID in IDNA 2008 [RFC5892]).

Considerable attention to allowable contexts is required. There is a group in 
Thailand working on this, but their results have not yet been made public.
Maybe this: Proposal for the Thai Script Root Zone Label Generation Rulesets 
<https://www.icann.org/en/system/files/files/proposal-thai-lgr-15dec16-en.pdf>

Just as long as you understand that it's not final, even for the problem domain it intends to address.

But the rules for Root Zone Labels are (rightly) much more restricted than what 
I want:

One key difference is that the rules define a preferred ordering, and do not define a folding. Obviously, knowing a preferred ordering allows anyone to define a folding that results in that ordering.

Another generic difference between an LGR for network identifiers (Root Zone or otherwise) and filenames is that and LGR will tend to disallow pathological combinations, even if they are in an unambiguous order. "Pathological" combinations are those that result in unpredictable rendering - not just for a few isolated fonts, but across the board.

I would argue that for complex scripts, there may be a case for restricting filenames in a similar manner: expecting that any random combining sequence of unbounded length (up to the full filename) should be supported will surely lead to filenames that are impossible to tell apart; usually because they either do not get rendered in a sensible way, or things get clipped.

This may even be the case for combining sequences in general.

LGRs, and the Root Zone LGR in particular, go one step further: they tend to explicitly excluded characters that are obsolete, rare, historic, special use, and so on; this is done for two main reasons: to keep the resulting names recognizable to the majority of users and to avoid the kinds of problems introduced by these characters.

For example, for Arabic, the consensus seems to be that for domain names, one really doesn't want to support the combining marks. They are not needed there, unlike general text, and only lead to a bewildering host of non-normalizable dual representations, for which otherwise a folding would have to be defined.

Finally, LGRs have some features that go beyond having a clean and focused repertoire and a defined ordering: those are the cases where two strings look identical, but neither can be construed as "preferred". In an LGR these strings can be made "mutually exclusive" using the blocked variant mechanism (see RFC 7940). Some file systems have rudimentary forms of this, for example those that are case-preserving but not case-sensitive. Once a filename is used, its "variant" can no longer be added, but there's no a-priori folding into a preferred form.

Other than performance, perhaps, there's no reason a file system's valid file name space couldn't be described via RFC 7940. (Even with the full features of RFC 7940, collision checking can be implemented as an O(1) process for each new file name to be added to a folder). In addition to NFC, some additional foldings might be supplied to transform user input to valid file names (from case folding to some more complex folding like the one you are discussing). Like case-insensitive, non-preserving file systems, adding such foldings would return file names that can be different from the ones the user specified.

Again, whether or not you supply a folding is separate from defining a preferred ordering. For the latter, you might start with the work the Thai Generation Panel has been doing, so that valid network identifiers can immediately be valid file names.

A./

Any two strings which look (almost?) identical should be normalised into some 
canonical form.
Reason: not to have identical looking filenames in a filesystem.
With the current rules of normalisation there could be 8 different filenames 
all looking identical to “กินครึ่งทิ้งครึ่ง”.

E.g. :
- both NIKHAHIT + Sara Aa  and Sara Am should be normalised into the same 
string (whatever this is)
- both top-vowel + tone-mark and  tone-mark + top-vowel should be normalised 
into the same string (whatever this is).
etc.

If, as Richard Wordingham wrote: "Unicode combining classes cannot be changed.  
All that can be done is
to enforce the order of characters in normalised text.” then the Unicode 
Normalisation algorithms should be updated.


Kind regards,

Gerriet.



Reply via email to