On Tue, 10 Jan 2017 17:25:06 -0800 Asmus Freytag <[email protected]> wrote:
> On 1/10/2017 2:54 PM, Richard Wordingham wrote: > There are many different tacks that can be taken to make spoofing > more difficult. > > Among them, for critical identifiers: > 1) allow only a restricted repertoire > 2) disallow certain sequences > 3) use a registry and > 3a) define sets of labels that overlap (variant sets) > 3b) restrict actual labels to be in disjoint sets > (one label blocks all others in the same variant set) > > The ICANN work on creating label generation rules attempts to > implement these strategies (currently for 28 scripts in the Root Zone > of the DNS). The > work on the first half dozen scripts is basically completed. > > > The Unicode standard does define what > > short sequences of characters mean. The problem is that then, > > outside the Apple world, it seems to be left to Microsoft to decide > > what longer sequences they will allow. > > MS and Apple are not the only ones writing renderers. HarfBuzz OpenType rendering tries to follow MS. That includes dotted circles. However, it will challenge the MS lead when it is blatantly wrong. In particular, it has a policy of rendering canonically equivalent text the same, though that is a challenge when emulating USE. So far as I am aware, M17n is not in wide use. It's tolerant, but one's text won't go far if it relies on M17n. Text can travel with a graphite font, but that is limiting. Sooner or later, one will want most text to work with different fonts. I'm having trouble digging up hard facts about InDesign's rendering, so I don't know how willing it is to be different to Microsoft's. > > Perhaps ICANN will be the industry-wide definer. However, to stay > > with Indic rendering, one may have cases where CVC and CCV > > orthographic syllables have little to no visible difference. The > > Khmer writing system once made much greater use of CVC syllables. > > For reproducing older texts, one might be forced to encode phonetic > > CVC as though it were CCV. > The restriction on sequences appropriate as an anti-spoofing measure > are not appropriate on general encoded text! So ICANN will at best serve to indicate sequences that should be renderable. > The project I'm involved in tackles only transitive forms of > equivalence (whether visual or semantic). > Collisions based on these equivalences can be handled with label > generation rulesets defined per RFC 7940, which allow registration > policies that are automated. > The further "halo" of "merely" similar labels needs to be handled > with additional technology that can handle concepts like similarity > distance. 'Merely' similar CCV and CVC tend to differ when the vowel is above the consonant and the subscript consonant is spacing, e.g. because it rises to the hanging baseline. The difference, which is in vowel placement, is comparable to the variation within one person's handwriting. However, the difference in mean position seems to be statistically significant. The inequivalence issue starts to arise with spacing vowels, which is when one may find marks being applied to syllables rather than to individual glyphs. > From a Unicode perspective, there's a virtue in not over specifying > sequences, because you don't want to be caught having to re-encode > entire scripts should the conventions for the use of the elements > making up the script change in an orthography reform! This seems to run counter to Mark's idea of regexes defining scripts' words. > That does not mean that Unicode (at all times) endorses all > permutations of free-form sequences as equally valid. Just as well, as such freedom runs counter to the principle of avoiding inequivalent encodings of the same thing. Richard.

