2015-01-29 19:52 GMT+01:00 Karl Williamson <[email protected]>:
> Rule WB4 is > > "Ignore Format and Extend characters, except when they appear at the > beginning of a region of text.". > > Not clearly stated, but it appears to me that the ZWJ must be considered > here to be the beginning of a region of text, as we are looking at the > boundary between it and the "A". No rule specifically mentions ALetter > followed by an Extend, so by the default rule, WB14 > > "Otherwise, break everywhere (including around ideographs)" All the text is targeted at finding candidate positions for breaks. It is not very clear that "ignore" is definitive and means that there cannot be any further breaks before the Format and Extend characters, except at beginng of text. So all the rest of rules is ignored, there was a match and you stop there; no break before; Any × (Format | Extend) This is confirmed in other rules that state the word "otherwise", including the last one (WB14) you quote which is explciitly not applicable. But I agree with you that rules WB56 and WB57 should better be rewritten as (WB56a): ALetter × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter) (WB56c+WB57 combined): Hebrew_Letter × ((MidLetter | MidNumLet) (ALetter | Hebrew_Letter) | Single_Quote) Note also that for French, the single quote is followed by a word break, but NOT a linebreak by default, and also NOT a syllable break for hyphenation) except in very few exceptions like "aujourd'hui" which is treated now as a single word -there's an elision but also a contraction of 4 words as if it was written "au jour d' hui", but the term "hui" no longer occurs anywhere isolately except for that common word where all components are glued), most elision apostrophes normally occur at end of word (e.g. after the two apostrohpes in « l'année n'est pas terminée »). The rare cases where you should not break after an apostrophe is when elision occurs in the middle of a word in some vulgar expressions like « c't'après-m' » which contains two informal words « c't' » and « après-m' » which are abbreviating « cet après-midi » in popular language. In English you have the case where the elision occurs at the begining of a word : « it's » is two words « it » and « 's » abbreviating « is » : or in the middle « aren't » containing two glued words « are » and « n't » abbreviating « not ». In both cases, you can use the WB rules, but then treat some exceptions for candidate. This way a single matching rule is needed and you no longer need to look for other rules. But we are not discussing line breaks here, but only word breaks (for the purpose of performing dictionary lookups and grammar analysis) : we shouldbe able with the default rules to "unglue" the words by default, using then an exception lsiss to see if we must reattach them as they are not all words. So first attempt to look for word terminated by an apostrophe, and then perform language-dependand perform lookup for known exceptions (« aujourd' » « hui » cannot match because « hui » is not a separate word) fow whch we must try something else : Look for word starting by an apostrophe (n English « it's » would be first treated bythe previous rule as « it' » and « s » but « s » alone is treated as an exeption, then with this rule it will correctly idenofy « ’s » independantly of the previous word, except if it is an acronym like in « GMO's » because in that case the « 's » is not a separate verb or a genitive particle but a known plural mark). Word breaks are more complicate to handle than line breaks as they need to perform dictionary lookups to assert them, But this is the purpose of a word breaking process to be used in order to perform dicutionnary lookups. With it, ou can then safely talior the line breaking alogorith in otder to implement syllable breaking for hyphenation which needs these dictionary lookups also to detect exceptions to the normal syllable breaks (which can be performed only with langiage-secific loolups for some pairs, or digrams or trgrams)
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

