I apology in advance that I'm running low on time, and didn't go through all the messages on this thread carefully. So I may not be fully appreciating people's positions. I'm just making some quick points about 2 items that caught my eye.
1. There are certainly times where two rules in sequence may overlap, just for simplicity. X Y* x Z Y x Z* W The first rule could trigger on X Y Z W, even though the second would also trigger on it. This may or may not be "sloppiness"; sometimes it simply makes the second rule too convoluted to also exclude triggering on everything that could possibly trigger earlier. That being said, if there simplifications in the rules that would make it clearer, I'd suggest submitting a proposal for that. The UTC is meeting next week, and could consider it either then or at subsequent meetings. Note: the HTML files in http://unicode.org/Public/UNIDATA/auxiliary/ have a number of sample cases (which are also used in the test files). Hovering over boundaries in those sample cases shows which rule is triggered, such as in http://unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakTest.html#samples We're always open to additional samples that are illustrative of how the rules work. As I thought about your message, it became clear to me that it would be useful to have a complete enough set of sample cases that each rule is triggered by at least one case, if you or anyone else is interested in helping to add those. 2. Also, the following 2 rules are not equivalent: a) Any × (Format | Extend) b) X (Extend | Format)* → X (b) implies (a), but not the reverse. The difference is on the right side of characters. Rule b, affects every subsequent rule, and can be viewed as a shorthand. After it, we can just say: A B × C D And that has the effect of saying: A (Extend | Format)* B (Extend | Format)* × C (Extend | Format)* D See also http://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules However, it may not be clear that (b) implies (a); that might be what you are getting at. If so, then we could add an explicit statement to that effect. Mark <https://google.com/+MarkDavis> *— Il meglio è l’inimico del bene —* On Thu, Jan 29, 2015 at 7:52 PM, Karl Williamson <[email protected]> wrote: > On 01/25/2015 05:14 AM, Philippe Verdy wrote: > >> This is not a contradiction. >> > > At the very least it is too sloppy for a standard. Once there is a match > in the list of rules, later rules shouldn't have to be looked at. I'll > submit a formal feedback form. > > But there is another issue as well. I do not see how the specified rules > when applied to the sequence of code points: > > U+0041 U+200D U+0020 > > cause the ZWJ, an Extend, to not break with the "A", an ALetter. > > Rule WB4 is > > "Ignore Format and Extend characters, except when they appear at the > beginning of a region of text.". > > Not clearly stated, but it appears to me that the ZWJ must be considered > here to be the beginning of a region of text, as we are looking at the > boundary between it and the "A". No rule specifically mentions ALetter > followed by an Extend, so by the default rule, WB14 > > "Otherwise, break everywhere (including around ideographs)" > > this should be a word break position. But that is absurd, as the Extend > is supposed to extend what precedes it. If I add a rule > > "Don't break before Extend or Format" > × (Extend | Format) > > my implementation passes all tests. I added this rule before WB4. > > > >> combine the two rules and they are equivalent to these two alternate >> rules: >> WB56 can be read as these two: >> >> (WB56a) ALetter × (MidLetter | MidNumLet | Single_Quote) (ALetter | >> Hebrew_Letter) >> >> (WB56b) Hebrew_Letter × (MidLetter | MidNumLet | Single_Quote) >> (ALetter | Hebrew_Letter) >> >> >> Then add : >> >> (WB57) Hebrew_Letter × Single_Quote >> >> it just removes the condition of a letter following the quote in WB56b. >> So that WB56b and WB57 can be read as equivalent to these two: >> >> (WB56c) Hebrew_Letter × (MidLetter | MidNumLet) (ALetter | >> Hebrew_Letter) >> >> (WB57) Hebrew_Letter × Single_Quote >> >> But you cannot merge any of these two last rules in a single rule for >> WB56. >> >> >> 2015-01-25 7:26 GMT+01:00 Karl Williamson <[email protected] >> <mailto:[email protected]>>: >> >> I vaguely recall asking something like this before, but if so, I >> didn't save the answers, and a search of the archives didn't turn up >> anything. >> >> Some of the rules in UAX #29 don't make sense to me. >> >> For example, rule WB7a >> Hebrew_Letter × Single_Quote >> >> seems to say that a Hebrew_Letter followed by a Single Quote >> shouldn't break. (And Rule WB4 says that actually there can be >> Extend and Format characters between the two and those should be >> ignored). >> >> But the earlier rule, WB6 >> >> (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | >> Single_Quote) (ALetter | Hebrew_Letter) >> >> seems to me to say (among other things) that a Hebrew Letter >> followed by a Single Quote shouldn't break if and only if the latter >> is also followed by either an ALetter or another Hebrew Letter >> (again modulo ignored Format and Extend letters) >> >> This seems contradictory. One rule says something unconditionally, >> and the other rule adds conditions. >> _________________________________________________ >> Unicode mailing list >> [email protected] <mailto:[email protected]> >> http://unicode.org/mailman/__listinfo/unicode >> <http://unicode.org/mailman/listinfo/unicode> >> >> >> > _______________________________________________ > Unicode mailing list > [email protected] > http://unicode.org/mailman/listinfo/unicode >
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

