Re: Unicode Bidi Algorithm – Java reference implementation

2016-09-19 Thread Philippe Verdy
I note that there's a confusion in the introduction of UAX#9:

"On web pages, the explicit directional formatting characters (of all types
– embedding, override, and isolate) should be replaced by using the dir
attribute and the elements BDI and BDO."

The suggested replacements do not match the order of the listed types.
- embedding (with LRE/PDF or RLO/PDF) just uses the dir="ltr/rtl" attribute
on any element (except BDI and BDO)
- override (with LRO/PDF or RLO/PDF) uses BDO with
the dir="ltr/rtl" attribute
- explicit isolate (with LRI/PDI or RLI/PDI) uses BDI with
the dir="ltr/rtl" attribute
- "automatic" isolate (with FSI/PDI) uses BDI without any dir attribute

Two implicit directional characters (LRM or RLM) are also convertible to
overrides as an empty BDO element with dir="ltr/rtl". Only ALM has no
equivalent.



But for most cases, HTML documents should simply not use embedding or
override at all, isolates with BDI are much prefered and are in fact
simpler to manage than what section 6.4 suggests (this suggestion using RLM
or LRM before the separating punctuation does not work reliably as it
implies that you can predict the implicit reading direction of the whole
list, whose ordering is normally depending on the context or the document
containing the list. It is much simpler to isolate each list element and
then pack the list using the unmarked punctuations.

An example of this is found on International wikis thart must display some
inter-language bar to navigate to other translated versions of the same
page: the same template will be used on all pages, and the list of
languages is not predicted and may evolve over time, containing LTR or RTL
language names in unpredictable occurences anywhere in the list,
formatted  with the same separatorwithin a single inline span in a
paragraph starting by a translatable introduction heading, and you cannot
predict which language name will occur after that separator. Using BDI
(without even needing any dir=rtl/trl") or FSI/PDI to isolate each language
name will work much better than using uncondiionnaly some static RLM or LRM
before the separating punctuation (note that there's no such punctuation at
start of the list, so the ordering of the first element is not set
correctly unless there's a RLM or LRM also before that first element, which
may then render incorrectly).

The best and most flexible solution is to use "automatic" isolates for each
list item (with FSI/PDI in plain-text documents, or BDI elements without
any dir attribute in HTML documents). The same is also true when inserting
quotations (including when giving the title of another document, or the
name of an author) or for formatting translatable text containing
"placeholder variables" whose content will be generated separately. BDI
elements without any dir attribute can efficiently replace SPAN elements,
and can still have their own optional formatting styles (colors, font
families, font size, line height, font styles and weight, visual
effects...), or title attributes (to give hints to readers about what the
isolate value will be used for), or identifier (useful to generate stable
anchors that work across all translations of the document).

There are also CSS styles using unicode-bidi properties, but they should be
completely avoided in HTML (these styles will be better infered from BDI
elements)



2016-09-19 2:16 GMT+02:00 Ken Whistler :

>
> On 9/17/2016 10:26 AM, Deepak Jois wrote:
>
>> I now need to make the updates to support the changes in Unicode 8.0,
>> and I am finding it a bit hard to grok the changes in C at a glance.
>>
>>
> The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change
> much about the gross behavior of the algorithm, but there were some fixes
> for edge cases in a couple rules. Also, the specification of behavior on
> stack overflow became exact, rather than implementation-defined.
>
> The C bidi reference code is a bit complicated, because it supports *all*
> UBA versions from 6.2 through 8.0, which means it has to special case rule
> processing by versions when the specification itself changes.
>
> If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c
> you'll find the heart of the differences there, along with explanations in
> comments for the changes. The new function br_SetBracketPairBC handles an
> edge case for combining marks following a bracket. The code using a new
> flag testONisNotRequired deals with an edge case for the current Bidi_Class
> of brackets being tested for pairing. Changes in br_PushBracketStack are
> involved in the need to keep the pre-8.0 behavior as it was for earlier
> versions of bidiref, but allowing for explicit behavior for stack overflow
> for 8.0.
>
> It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, so
> you can see the textual changes in the specification of the rules. Try
> diffing:
>
> http://www.unicode.org/reports/tr9/tr9-31.html (7.0)
> 

Re: Unicode Bidi Algorithm – Java reference implementation

2016-09-18 Thread Ken Whistler


On 9/17/2016 10:26 AM, Deepak Jois wrote:

I now need to make the updates to support the changes in Unicode 8.0,
and I am finding it a bit hard to grok the changes in C at a glance.



The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change 
much about the gross behavior of the algorithm, but there were some 
fixes for edge cases in a couple rules. Also, the specification of 
behavior on stack overflow became exact, rather than implementation-defined.


The C bidi reference code is a bit complicated, because it supports 
*all* UBA versions from 6.2 through 8.0, which means it has to special 
case rule processing by versions when the specification itself changes.


If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c 
you'll find the heart of the differences there, along with explanations 
in comments for the changes. The new function br_SetBracketPairBC 
handles an edge case for combining marks following a bracket. The code 
using a new flag testONisNotRequired deals with an edge case for the 
current Bidi_Class of brackets being tested for pairing. Changes in 
br_PushBracketStack are involved in the need to keep the pre-8.0 
behavior as it was for earlier versions of bidiref, but allowing for 
explicit behavior for stack overflow for 8.0.


It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, 
so you can see the textual changes in the specification of the rules. 
Try diffing:


http://www.unicode.org/reports/tr9/tr9-31.html (7.0)
http://www.unicode.org/reports/tr9/tr9-33.html (8.0)

The significant changes there are in BD11, BD14, BD15, BD16, and in 
rules X5a, X5b, X6a, and N0. (The rest of the changes in the updated 
document are cosmetic.)


--Ken



Re: Unicode Bidi Algorithm – Java reference implementation

2016-09-17 Thread Deepak Jois
On Sat, Sep 17, 2016 at 9:53 PM, Khaled Hosny  wrote:
> I think there is a C implementation that is kept up to date,

Yes, I found that one after I posted. FWIW, here are the changes for
the latest version:

https://gist.github.com/deepakjois/5a3ae81a105abd3523ed0efe2e52f52e/revisions

> is also a Python implementation that should pass the tests

That implementation looks very different from the C and Java versions.
I can’t tell by looking at a glance if it has been updated for the
changes in Unicode 8.0. But it definitely will not pass the tests in
BidiCharacter.txt because it lacks support for paired brackets.

I just finished writing a reference implementation in Lua[1] which is
a line by line port of the Java reference implementation and passes
nearly all tests in BidiCharacter.txt.

I now need to make the updates to support the changes in Unicode 8.0,
and I am finding it a bit hard to grok the changes in C at a glance.

Deepak

[1]: https://github.com/deepakjois/luabidi/blob/master/src/bidi.lua



Re: Unicode Bidi Algorithm – Java reference implementation

2016-09-17 Thread Khaled Hosny
On Sat, Sep 17, 2016 at 05:01:10PM +0530, Deepak Jois wrote:
> Hi
> 
> It seems that the Java reference implementation for the Unicode Bidi
> algorithm that I downloaded from the unicode.org site fails against
> some test cases in the BidiCharacterTest.txt file – the ones that are
> specifically meant to test for changes in Unicode 8.0.
> 
> Has the reference implementation been updated, and does anyone have a
> copy they can share? Is there a reference implementation in some other
> language that I could look at, which has been updated?

I think there is a C implementation that is kept up to date, and there
is also a Python implementation that should pass the tests:
https://github.com/behdad/pybyedie

Regards,
Khaled


Unicode Bidi Algorithm – Java reference implementation

2016-09-17 Thread Deepak Jois
Hi

It seems that the Java reference implementation for the Unicode Bidi
algorithm that I downloaded from the unicode.org site fails against
some test cases in the BidiCharacterTest.txt file – the ones that are
specifically meant to test for changes in Unicode 8.0.

Has the reference implementation been updated, and does anyone have a
copy they can share? Is there a reference implementation in some other
language that I could look at, which has been updated?

Thank you
Deepak