Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Asmus Freytag Mon, 21 Apr 2014 10:56:12 -0700

Philippe,

I fail to understand how your post contributes to the topic.

The issue was unclear wording of the specification, not deficiencies inthe UBA or the PBA in general.

Let's keep this discussion limited to issues of wording for the*existing* specification. Feel free to start a new discussion aboutsomething else under a new subject.


A./

On 4/21/2014 9:18 AM, Philippe Verdy wrote:

There are some cases where these rules will not be clear enough. Lookat the following where overlaps do occur; but directionality stillmatters:
"This is an [«] example [»] for demonstration only."
There are two parsings possible if you just consider a hierarchiclayout where overlaps are disabled:
1. "This is an [...] for demonstration only.", embedding "«...»",itself embedding "] example [" (here the square brackets match externally)
2. "This is an [...] example [...] for demonstration only.", embeddingtwo spans for "«" and "»" separately (they also pair externally)
Now suppose that the term "example" is translated in Arabic: It is notvery clear how the UBA will work while preserving the correct pariingdirection of the 3 pairs (one pair is "«...»", there are two pairs for"[...]"). Still all 3 pairs have a coherent direction thatBidi-reordering or glyph mirorring should not mix.
I see only one solution to tag such text so that it will behavecorrectly: either the two pairs of square brackets or the pair orguillemets should be encoded with isolated Bidi overrides. But thenwhat is happening to the ordering of the surrounding text?
There should be a stable way to encode this case so that UBA willstill work in preserving the correct reding order, and the expectedsemantics and orientation of pairs and the fact that the guillemetsare effectively not really embedding the brackets, but the translatedword "example".
There are several ways to use Bidi-override or Bidi-embeddingcontrols; I don't know which one is better but all of them shouldstill work with UBA. I just hope that the complex cases of thebrackets in the middle ("]...[") can be handled gracefully.
My opinion would require embedding and isolating the each squarebracket, they will no longer match together (externally they aretreated as symbols with transparent direction, but how we ensure thatthe sequence "[«]" will still occur before the RTL (Arabic) "example"word followed by the sequence "[»]" and that the rest of the sentence(for demonstration only) will still occur in the correct order : wealso have to embed/isolate the "example", or the whole sequence "[«]example [»]" so that the main sentence "This is an ... fordemonstration only" will stil have a coherent reading direction.
Such cases are not so exceptional because they occur to represent twodistinct parallel readings of te same text, where in one reading forone kind of pairs will simply treat the other pairs as ignored"transparently".
It should be an interesting case to investigate for validating UBAalgorithms in a conformance test case.
2014-04-21 16:32 GMT+02:00 Asmus Freytag <[email protected]<mailto:[email protected]>>:
    On 4/21/2014 1:33 AM, Eli Zaretskii wrote:
    Date: Sun, 20 Apr 2014 23:03:20 -0700
    From: Asmus Freytag<[email protected]>  <mailto:[email protected]>
    CC: Eli Zaretskii<[email protected]>  <mailto:[email protected]>,[email protected]  
<mailto:[email protected]>,
      Kenneth Whistler<[email protected]>  <mailto:[email protected]>
             Note that the current embedding level is not changed by this rule.

         What does this last sentence mean by "the current embedding level"?
         The first bullet of X6 mandates that "the current character’s
         embedding level" _is_ changed by this rule, so what other "current
         embedding level" is alluded to here?
         I'm punting on that one - can someone else answer this?


    I assume "current embedding level" here meant "the embedding level of
    the last entry on the directional status stack". (This is a natural
    slip to make if you think in terms of an optimized implementation that
    stores each component of the top of the directional status stack in a
    variable, as suggested in 3.3.2.)

    James
    In general, I heartily dislike "specifications" that just narrate a
    particular implementation...
    I cannot agree more.

    In fact, my main gripe about the UBA additions in 6.3 are that some of
    their crucial parts are not formally defined, except by an algorithm
    that narrates a specific implementation.  The two worst examples of
    that are the "definitions" of the isolating run sequence and of the
    bracket pair.  I didn't ask about those because I succeeded to figure
    them out, but it took many readings of the corresponding parts of the
    document.  It is IMO a pity that the two main features added in 6.3
    are based on definitions that are so hard to penetrate, and which
    actually all but force you to use the specific implementation
    described by the document.

    My working definition that replaces BD13 is this:

       An isolating run sequence is the maximal sequence of level runs of
       the same embedding level that can be obtained by removing all the
       characters between an isolate initiator and its matching PDI (or
       paragraph end, if there is no matching PDI) within those level runs.

    As for bracket pair (BD16), I'm really amazed that a concept as easy
    and widely known/used as this would need such an obscure definition
    that must have an algorithm as its necessary part.  How about this
    instead:

       A bracket pair is a pair of an opening paired bracket and a closing
       paired bracket characters within the same isolating run sequence,
       such that the Bidi_Paired_Bracket property value of the former
       character or its canonical equivalent equals the latter character or
       its canonical equivalent, and all the opening and closing bracket
       characters in between these two are balanced.

    Then we could use the algorithm to explain what it means for brackets
    to be balanced (for those readers who somehow don't already know
    that).

    Again, thanks for clarifying these subtle issues.  I can now proceed
    to updating the Emacs bidirectional display with the changes in
    Unicode 6.3.
    FWIW here is the restatement of BD16 that I used for myself (and
    that I put
    into the source comments of the sample Java implementation):

        // The following is a restatement of BD 16 using
    non-algorithmic language.
        //
        // A bracket pair is a pair of characters consisting of an opening
        // paired bracket and a closing paired bracket such that the
        // Bidi_Paired_Bracket property value of the former equals the
    latter,
        // subject to the following constraints.
        // - both characters of a pair occur in the same isolating run
    sequence
        // - the closing character of a pair follows the opening character
        // - any bracket character can belong at most to one pair, the
    earliest possible one
        // - any bracket character not part of a pair is treated like
    an ordinary character
        // - pairs may nest properly, but their spans may not overlap
    otherwise

        // Bracket characters with canonical decompositions are
    supposed to be treated
        // as if they had been normalized, to allow normalized and
    non-normalized text
        // to give the same result.

    Your language is more concise, but you may compare for differences.

    A./

    _______________________________________________
    Unicode mailing list
    [email protected] <mailto:[email protected]>
    http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Reply via email to