Re: TR29 Word Break awkwardness

Asmus Freytag Tue, 14 Sep 2004 16:18:03 -0700

It might be worth stepping back and asking the question: What is the purpose of publishing word-breaking behavior as part of the Unicode Standard?

The answer to this question is neither easy nor obvious. Part of the problem is that what constitutes a 'word' is subject to tailoring. In certain languages and or certain situations, implementations may need to make different choices in behavior than the ones we document. That puts a natural limit on how 'accurate' our default rules can and should be for certain rare and special cases.

On the other hand, certain characters explicitly have word-connecting semantics. Part of our wordbreaking rules is to provide a convenient description of that behavior (together with a list of characters affected). In some instances, such special semantics are not subject to tailoring, as doing so would subvert the reason for the existence of a given character in the Unicode Standard. Note that n the description of linebreaking, which has similar issues to the ones discussed here, such characters are called out explicitly.

Finally, there are some complexities that result from the overall architecture of the Unicode Standard (and which in turn are driven by complexities in the writing systems it attempts to cover). The existence of combining marks and grapheme clusters are one of them. One of the purposes of providing the word break rules must be to give accurate guidance on handling these complexities. That in turn argues for (rather than against) detailed specifications of edge cases involving combining marks, absence of base characters, etc., so that not all implementers have to rediscover such issues independently.

After this preamble, what can we conclude about the issue?

The rule "treat combining marks" like their base character works well if the base character has definite behavior, i.e. is either a letter or a symbol. In such cases the combining sequence acts like a letter or symbol (or number, or punctuation mark) and that most naturally follows from the fact that most combining marks are used as distinguishing decorations on character, or, act like letters following other letters. (The use of combining marks on symbols or punctuation is in the domain of specialized notations, such as mathematics or musical notation, both of which can be expected to heavily tailor word breaking and other default algorithms)

However, when NBSP is used as base character standin (the use of SPACE for this will be deprecated in 4.1 for reasons that should have become abundantly clear in this discussion alone) the resultant combining sequence is best treated as a letter, not as a space. In line breaking, the use of NBSP has that effect, as the use of NBSP and letters are very similar, but in word breaking this is not true.

It might make sense to explicitly add these rules:

1) treat all sequences of NBSP followed by combining marks as ALetter 2) treat all combining marks w/o base character (i.e. start of text, after control codes) as ALetter

(rule 2  is implicitly captured by making Grapheme_Extend < ALetter).

However, in linebreaking we found that implementing rules of the form: "treat all x followed by combining marks as y" are difficult to implement, with fewer difficulties for the case where x = y.

Adding an INVISIBLE, zero-width, base-letter of class ALetter for wordbreak and AL for linebreak would allow those who mangage to enter one of them into a text to specify precisely that any combining sequence with this base letter should be treated like a letter for these purposes. It would also allow the placement of diacritical marks *between* letters (as cited by J. Knappen), but only if layout engines implement a true zero width for the combining sequence. (The latter would clash with the use of this character in citing a standalone diacritic, where some width usually is desirable).

The biggest problem in adding a new letter is that all rendering implementations would need to be updated to recognize it, and all input methods would have to be updated to allow users access to it. Especially due to the latter, the use of SPACE (and possibly NBSP) for this purpose will continue.

There are no easy answers.
A./

Re: TR29 Word Break awkwardness

Reply via email to