Adding Behdad for his insight on the rendering stack. But as for user requirements and expectations, the first option, with the hyphen on the right side of "car" as "car-" is what a good publisher would want to print in his magazine or book. The second option is harder to decipher for an RTL reader.
(Note that breaking opposite-direction phrases across lines in bidi paragraphs is also avoided as much as possible in good typography, as the output is weird to some readers anyway.) On Apr 1, 2014 1:21 PM, "Whistler, Ken" <[email protected]> wrote: > I don’t think the answer is directly deduced from UAX #9, because > > it involves deciding where to insert a visible hyphen for display. > > However, I think the correct answer here is your number two guess, > > i.e. (in a RTL paragraph context): > > > > -car SI TORRAC > > > > A way to think about this, rather than starting from the BN nature > > of U+00AD, is to ask what would happen if there was an *explicit* > > hyphen-minus at the same position. Shortening your example > > line “CARROT IS car\u00AD” to just the equivalent of “ABC car-“, > > the outcome of the bidiref processing for a RTL paragraph context is: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D > > Bidi_Class: R R R R L L L R > > Levels: 1 1 1 1 2 2 2 1 > > Runs: <R-----------------------------------R> > > > > Order: [7 4 5 6 3 2 1 0] > > > > In other words, on display: > > > > -car CBA > > <--------- > > > > with the hyphen-minus at the *end* of the reordered line, as > > expected. > > > > If you run the same example, but substituting U+00AD for U+002D, you get: > > > > Trace: Entering br_UBA_ReverseLevels [L2] > > Current State: 19 > > Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD > > Bidi_Class: R R R R L L L BN > > Levels: 1 1 1 1 2 2 2 x > > Runs: <R-----------------------------------R> > > > > Order: [4 5 6 3 2 1 0] > > > > And the display for that would be: > > > > car CBA > > > > But *then* your hyphenation algorithm would presumably kick in and decide > > that the U+00AD is at the end of the line and should display as a visible > > hyphen glyph. But “end of the line” here means the same as it would for > > the explicit hyphen-minus, so when you insert the visible hyphen glyph, you > > end up with the same result: > > > > -car CBA > > > > Another way of looking at this is that in order to line break your text in > > the first place, you need to be able to calculate the resolved display > width > > to fit in the line. That would have to include the visual display of the > inserted > > hyphen glyph. So once you have *decided* to break the line at the soft > > hyphen, in effect, you substitute a visual display symbol U+002D (or > > the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the > > results to get the resolved order of all the elements on the line. The net > > effect should be the same. > > > > Maybe folks with full implementations of bidi rendering would have more to > > contribute on this, but that would be my own take on the problem. > > > > --Ken > > > > > > > > Suppose I have a paragraph (uppercase = RTL): > > > > CARROT IS car\u00ADrot IN ENGLISH > > > > and the paragraph gets broken at the soft hyphen. > > > > Is the correct ordering for the first line > > > > car- SI TORRAC > > > > or > > > > -car SI TORRAC > > > > ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has > bidi class BN, which means it gets removed in stage X9, and so, if I have > understood correctly, doesn't have a defined embedding level. > > > > I'm guessing the correct ordering is the first one, but I don't trust my > instincts here. (In particular, I wondered whether this was analogous to > the case where rule L1 resets embedding levels so that trailing whitespace > is at the visual end of the line.) > > > > More generally, suppose you have a markup language which has a construct > for discretionary breaks, as in TeX, with pre-break, post-break and > no-break text. Soft hyphen is a special case of this (where the pre-break > text consists of a hyphen, and the pos and no-break texts are empty); you > can also regard space as a kind of discretionary break (post-break text > empty, no-break text contains the space, pre-break text either contains the > space or is empty, depending on how you want to think about it). Obviously > the embedding level for the no-break text should be resolved as if > discretionary break was replaced by the no-break text (which is consistent > with a bidi class of BN for soft hyphen). However, for the pre- and > post-break text, it is not clear to me what the right way is to resolve > embedding levels (or how their content should be restricted so that there > is a sensible way to resolve the embedding levels). I would be grateful for > any suggestions. > > > > James > > > > > > > > > > > > _______________________________________________ > Unicode mailing list > [email protected] > http://unicode.org/mailman/listinfo/unicode > >
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

