On 23/11/16 10:01, Daniel Bünzli wrote:
On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote:
However, looking at the test case and the UAX[2], this does not look
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity
between them.

I'd say this is not the right operational model. From [1]:

"The rules are processed from top to bottom. As soon as a rule matches and produces 
a boundary status (boundary or no boundary) for that offset, the process is 
terminated."

So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 
quicks in. It does not produce a boundary status, it only changes your offset 
context to ZWJ GAZ, as you mention. Now you continue applying the rules 
sequentially with WB6 which does not match, with WB7 which does not match,... 
and you'll get to WB999 which matches and produces a boundary status.

After WB4 you do not restart the matching process from the beginning, as you 
do, leading you to say that WB3c should apply.

Hey Daniel,

Thank you for your reply, but I don't think the UAX, specifically the line you quoted implies that. The line you quoted says that the process is terminated when a rule matches and produces a boundary status. In Table 1[1], the right-arrow (which is used in rule 4) is listed as a boundary symbol, so I would argue that one should stop the process and start it again from the start.

Furthermore, in the clarification to rule 4[2] it clearly states: "The main purpose of this rule is to always treat a grapheme cluster as a single character—that is, as if it were simply the first character of the cluster".
This again sides with my understanding that:
X Extendend Y
should behave exactly the same as
X Y
after the extended part.
Which is exactly what I'm arguing for.

--
Tom

[1] http://www.unicode.org/reports/tr29/#Table_Boundary_Symbols
[2] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules

Reply via email to