On Sun, 16 Aug 2015 12:08:34 -0700 Ken Whistler <[email protected]> wrote:
> Some editorial oversight or a typo in the text of the core > specification cannot > be taken as legalistically somehow trumping the data file, just > because somebody finds it "written in the standard". > > Capiche? No. What about oversights and typos in the UCD? Indeed, two variation sequences were removed because it was found that their bases were decomposable, which contradicts the core specification. In this case, the UCD did not trump the rules for variation sequences. When there is a contradiction, it needs to be investigated and resolved, with awareness that different people may be relying on different parts of the specification. > (In most > cases, the core specification is simply underspecified because the > research, writing and editing for it is under-resourced.) That is also true of much of the UCD. I suspect that much of it relies on intelligent guesswork. Some properties may simply be ignored because nothing readily testable uses them (e.g. line and word-break properties relevant for scriptio continua writing systems), and others appear to be arbitrary. (Is the allocation of digits to L, AN or EN actually anything but an encoding decision?) Fortunately, most errors in the UCD can be corrected when the settings don't work; casing pairs, names, decompositions and canonical combining classes are the main problems. I believe problems arising from codepoint assignments could be fixed by created singleton decompositions, e.g. to change mere numbers into decimal digits. As an example of an effectively ignored line break property, I offer the line-break property of the Thai repetition character U+0E46 THAI CHARACTER MAIYAMOK. It is currently of general category Lm, and has the line-break property SA 'South-East Asian line-breaking'. This means that the Unicode line-breaking algorithm calls upon a non-standard algorithm to assign each instance of the character a line-break property. Now I believe that it should have line break property EX. I can find a grammatical description that says it should be separated from the preceding word by a space, and I have found no example in books of U+0E46 starting a line. Giving it line break property EX would prevent a line break between the space and the repetition mark. However, there is little point in trying to have it assigned line break property EX, for the Unicode assignment is irrefutable. My argument has to be addressed to the specifications of the algorithms doing Thai line-breaking. A historical example of errors in the UCD is U+200B ZERO WIDTH SPACE (ZWSP). It's primary use is as a word separator in scripts that don't have visible word separators, though I'm currently finding it useful in Word 2010 to split up excessively long path names without visible hyphens being added. When its general category was changed from Zs to Cf, its Unicode word-break property became 'Format'; it no longer had any effect on word-breaking. Its line-breaking behaviour was preserved, so the control of text layout was unaffected. For SE Asian languages, the change had no direct effect, for their word-breaking rules are largely outside the scope of the Unicode text segmentation algorithms. All went well until someone decided that TUS text describing it as a word-breaker was an 'editorial oversight'. A corrigendum removed this word-breaking behaviour, and SE Asian word processors started to misbehave as software maintainers caught up with the corrigendum. For details see an email from Javier Soláː http://unicode.org/mail-arch/unicode-ml/y2009-m01/0604.html . The referenced proposal gives the text of the erratum, dated May 2008. Presumably corrigenda did not then have numbers, for there is no trace of its former existence in http://www.unicode.org/versions/corrigenda.html . A similar process is now in progress for U+2060 WORD JOINER (WJ), which is the opposite of ZWSP. It is intended that WJ will cease to indicate the absence of word boundaries. In scripts that have visible line-boundaries, the absence of an effect on word-breaking is of no consequence for sequences of letters, for the mere juxtaposition of letters prevents a word-break between them. By contrast, SE Asian word-boundary detectors largely rely on recognising words, and they can make mistakes, or be given an impossible task. The English analogue is detecting the word boundary in 'humanevents' - is the last word 'events' or 'vents'? A notable challenge is to persuade a Thai spell-checker that a transliteration of 'Hemingway' is actually a single word. Delimiting the boundaries does not work - one has to join the fragments into which the automatic word-breaker splits it. The language proposed for ISO 10646, in http://www.unicode.org/L2/L2015/15211-word-joiner.pdf , does not actually state that it does not prevent a word break, though stronger text denying that it suppresses word breaks has been proposed for Unicode. By contrast, U+202F NARROW NO-BREAK SPACE (NNBSP) looks set to regain its originally intended purpose, that of a narrow space that does not break words. The script for which it was intended, Mongolian, will be able to use the Unicode word-boundary detection algorithm once NNBSP is allowed as part of a word. However, the fact remains that NNBSP should never have been allowed to break words. The core text has long stated that NNBSP does not break Mongolian words. There remains, however, a possibility that European usage of NNBSP will prevent it from recovering its intended functionality. > Yes, a notice at the top: > > @+ For details about the implementation of variation sequences in > Phags-pa, please refer to the Phags-pa section of the core > specification. a) This is likely to be ignored by someone who is just looking for the *specification*. I think replacing 'implementation' by 'rendering' would be better. I would be inclined to add, 'These sequences are more complicated than they appear at first reading'. Otherwise, someone will just add them to the character to glyph conversion section of a font and think, "Job done". b) This won't work where the effort has not been expended on the core text. As to StandardizedVariants.txt, Section 23.4 needs to refer to the Phags-pa section in the core text. As that file points to the Section 23.4 of TUS, this should then at least suggest that the descriptions in the file do not override the core specification. Richard.

