Bidi Parenthesis Algorithm and BidiCharacterTest.txt
Hi, One of the test cases in BidiCharacterTest.txt seems to me to contradict the description of the rules N0 through N2 of the UBA. Or maybe I'm missing something. Here are the details. The test case in question, on line 114 of BidiCharacterTest.txt, is as follows: 0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029;1;1;2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1;16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 The first field, up to the 1st semicolon, is the sequence of characters given by their Unicode codepoints, in the logical order. Translated into readable text, it looks like this: a ( ( { b ⚀ [ ] ) } [ c [ ] ] א ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 where I inserted blanks between every 2 characters, for better readability, and added position numbers. The next field of the test case data, whose value is 1, specifies that the paragraph direction is RTL, i.e. the embedding level is 1. Let me now present the application of N0 through N2, as I understand them, to this text. (Since there are no explicit directional codes here, and no weak characters, we can skip all the rules before N0.) The results of identifying bracket pairs, per BD16, sorted by the position of the opening bracket, are as follows: 2 and 17 3 and 9 7 and 8 11 and 15 13 and 14 Applying N0, we see that: . The pair 2-17 encloses 'א', which matches the embedding direction, so N0b instructs to resolve this pair as matching the embedding direction, i.e. R. . The pair 3-9 encloses 'b', whose direction is opposite to the embedding direction, and has 'a' before the opening bracket, so N0c1 says we should resolve this pair as L, the direction opposite to the embedding one. . The pair 7-8 encloses no strong characters, so it is left as is. . The pair 11-15 encloses 'c' and is preceded by 'b', so N0c1 again says to resolve this pair as L. . The pair 13-14 encloses no strong characters, so is left alone. Therefore, the result after N0 is this: a ( ( { b ⚀ [ ] ) } [ c [ ] ] א ) L R L N L N N N L N L L N N L R R Applying N1, we then obtain the following result: a ( ( { b ⚀ [ ] ) } [ c [ ] ] א ) L R L L L L L L L L L L L L L R R There are no neutrals left, so N2 doesn't need to be applied. Now I2 gives the following resolved levels: a ( ( { b ⚀ [ ] ) } [ c [ ] ] א ) 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 However, BidiCharacterTest.txt gives a different sequence of resolved levels: 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 Could someone please point out what am I missing or doing incorrectly? Thanks in advance. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bliss?
Markus Scherer markus dot icu at gmail dot com wrote: As Michael said, I don't have information. But I found this which might help: http://en.wikipedia.org/wiki/Blissymbols#Towards_the_international_standardization_of_the_script Statements in the linked article such as the following (not written by Markus) always trouble me: The proposed encoding does not use the lexical encoding model used in the existing ISO-IR/169 registered character set, but instead applies the Unicode and ISO character-glyph model to the Bliss-character model already adopted by BCI, since this would significantly reduce the number of needed characters. since my understanding has always been that the reasons behind the character-glyph model go much deeper than reducing the number of encoded characters. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bliss?
On 14 October 2014 17:06, Doug Ewell d...@ewellic.org wrote: Statements in the linked article such as the following (not written by Markus) always trouble me: Gosh, I wonder who it could have been? https://en.wikipedia.org/w/index.php?title=Blissymbolsdiff=331226727oldid=331223779 Andrew ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bliss?
On 14 Oct 2014, at 17:59, Andrew West andrewcw...@gmail.com wrote: On 14 October 2014 17:06, Doug Ewell d...@ewellic.org wrote: Statements in the linked article such as the following (not written by Markus) always trouble me: Gosh, I wonder who it could have been? https://en.wikipedia.org/w/index.php?title=Blissymbolsdiff=331226727oldid=331223779 Oof. Folks, I’m a member of the BC-UK committee and have been working with BCI for years to ready Bliss for encoding. Work proceeds apace. Michael Everson * http://www.evertype.com/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bidi Parenthesis Algorithm and BidiCharacterTest.txt
From: Andrew Glass (WINDOWS) andrew.gl...@microsoft.com Date: Tue, 14 Oct 2014 18:07:24 + The difference is that N0 is applied per bracket pair and the result of the resolution of one bracket pair may impact the resolution of other bracket pairs in the same isolating run sequence. So in your example: · 2-17 is resolved to R as you say. · Since 2-17 is now R and not neutral, the resolution of 3-9 is R because the check for context finds the opening parenthesis at 2 (now R) before the a at 1. Therefore 2-17 is R under N0c2. But there's nothing about this in the UAX#9 language! How did you arrive at this dependency, using just what the UBA says? The proposed update attempts to make this clearer in the intro to 3.3.5: http://www.unicode.org/reports/tr9/tr9-32.html#N0 Note that this rule is applied based on the current bidirectional character type of each paired bracket and not the original type, as this could have changed under X6. Perhaps this should be emended to include that N0 can also update the type for subsequent tests under N0, which is the case here. There's a big difference between X6 and N0. X6 is about the explicit override, and is applied before N0. Your interpretation makes N0 a recursive rule, something that is not even hinted at by the UBA spec. Currently N0 states: N0. Process bracket pairs in an isolating run sequence sequentially in the logical order of the text positions of the opening paired brackets using the logic given below. Example 1 illustrates a similar case in that the neutral ! resolves to R because of the bracket resolution to R rather than the context between two Ls. This of course takes place in N1 and not N0 as in the example you ask about. Of course! And so Example 1 is very different from what we are discussing, because each stage of the algorithm is applied to the results of the previous stage. But there's no other place, AFAICS, where the same stage is applied recursively. So I really don't see how this interpretation could be gleaned from the UBA description. Thanks for explaining, but it is really frustrating to find out about these untold subtleties at this late stage. (And yes, I've read the proposed changes in tr9-32.html, and not even they say anything about this.) How can we be sure that your interpretation is indeed correct, if it is not even hinted anywhere? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Bidi Parenthesis Algorithm and BidiCharacterTest.txt
Eli asked in response to Andrew: · Since 2-17 is now R and not neutral, the resolution of 3-9 is R because the check for context finds the opening parenthesis at 2 (now R) before the a at 1. Therefore 2-17 is R under N0c2. But there's nothing about this in the UAX#9 language! How did you arrive at this dependency, using just what the UBA says? See below. Perhaps this should be emended to include that N0 can also update the type for subsequent tests under N0, which is the case here. There's a big difference between X6 and N0. X6 is about the explicit override, and is applied before N0. Your interpretation makes N0 a recursive rule, something that is not even hinted at by the UBA spec. I disagree that this makes N0 a recursive rule. It is a rule with repeatedly applicable subparts. And like nearly all the rules in the UBA (except ones which explicitly state that they apply to *original* Bidi_Class values, which thus have to be stored across the life of the processing of the string in question), all rules apply to the *current* Bidi_Class values of the examined context. In this sense, the UBA, for most rules, operates as a set of change and forget steps. Thus in the case of N0, if you are processing a sequential list of bracket pairs, you just process each pair, one at a time, and it sees as its input whatever the *current* state is -- which may be (and often is) changed by the last step. What you do *not* need to do for N0 is preserve the starting state when N0 was initiated, and independently check each bracket pair against *that* array of Bidi_Class values while you are busy setting them to new values. Of course! And so Example 1 is very different from what we are discussing, because each stage of the algorithm is applied to the results of the previous stage. But there's no other place, AFAICS, where the same stage is applied recursively. So I really don't see how this interpretation could be gleaned from the UBA description. I agree that this could (and should) be made more explicit, as it is apparent that people can run into problems of interpretation here. An examination of the functioning of the N0 rule in the bidi reference implementations could, however, also be used to help explain what is intended here. For example, in the particular test case in question, the bidiref C implementation can have its debug diagnostics cranked up, and you find: Trace: Entering br_UBA_ResolveEN [W7] Current State: 13 Text:0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029 Bidi_Class: L ON ON ONL ON ON ON ON ON ONL ON ON ONR ON Levels: 1111111111111 1111 Runs: RR … Trace: Exiting br_SortPairList Pair list: {1,16} {2,8} {6,7} {10,14} {12,13} Debug: Strong direction e between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets Current State: 14 Text:0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029 Bidi_Class: LRR ONL ON ON ONR ONRL ON ONRRR Levels: 1111111111111 1111 Runs: RR Which is the clue needed to track down how the interpretation based on comparing Bidi_Class values retained from the initiation of rule N0 is incorrect. --Ken Thanks for explaining, but it is really frustrating to find out about these untold subtleties at this late stage. (And yes, I've read the proposed changes in tr9-32.html, and not even they say anything about this.) How can we be sure that your interpretation is indeed correct, if it is not even hinted anywhere? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bidi Parenthesis Algorithm and BidiCharacterTest.txt
From: Whistler, Ken ken.whist...@sap.com Date: Tue, 14 Oct 2014 22:14:02 + Cc: Whistler, Ken ken.whist...@sap.com, unicode@unicode.org unicode@unicode.org I disagree that this makes N0 a recursive rule. It is a rule with repeatedly applicable subparts. And like nearly all the rules in the UBA (except ones which explicitly state that they apply to *original* Bidi_Class values, which thus have to be stored across the life of the processing of the string in question), all rules apply to the *current* Bidi_Class values of the examined context. Can you point out where this is stated in the UBA? According to my reading of the UBA, only W7 could qualify as something similar to the recursive interpretation of N0. All the other rules are either defined in a way that the recursion cannot happen (because the conditions for applying the rule disappear after it is applied once), or explicitly speak about a sequence of similar characters whose bidi types are modified in the same manner. Trace: Exiting br_SortPairList Pair list: {1,16} {2,8} {6,7} {10,14} {12,13} Debug: Strong direction e between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets This doesn't explain _why_ the decision was that the direction between brackets was one or the other. Which is at the core of the issue at hand. So this debugging output doesn't really help here. In any case, when designing an implementation, one normally expects to read some formal requirements, not learn those requirements from another implementation. Anyway, I'm glad we all agree that, once again, the new additions to the UBA, and the BPA-related ones in particular, are not described well enough to avoid misinterpretations and misunderstanding such as this one, and that the language should be improved and clarified, hopefully sooner rather than later. I've just lost 20 hours of work due to that. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode