Re: delete ligature support for Arabic "la" from the less(1) command line

Ali Farzanrad Mon, 09 Sep 2019 17:59:07 -0700

Hi Ingo,

Thanks for your effort in unicode support.  I hope my feedback as a
native Persian would be helpful.


Ingo Schwarze <[email protected]> wrote:
> If i understand correctly, xterm(1) does indeed have that problem.
> I prepared a test file that contains, in this order,
> 
>  - some Latin characters
>  - the Arabic word "la" ("no"), i.e. first LAM, then ALEF
>  - some more Latin characters
>  - the Arabic word "al" ("the"), i.e. first ALEF, then LAM
>  - some final Latin characters
> 
> And indeed, xterm(1) does not respect the writing direction of the
> individual words.  When cat(1)'ing the file to stdout, both xterm(1)
> and konsole(1) show all the words from left to right, but *inside*
> each word, konsole(1) uses the correct writing direction: right to
> left for Arabic and left to right for Latin.  For example, in the
> Arabic word "al", konsole(1) correctly shows the ALEF right of the
> LAM, whereas xterm(1) wrongly shows the ALEF left of the LAM.
> 

There are many rules.  Each letter / character has a direction by
itself.  For example English letters are LTR (left-to-right), Arabic /
Persian letters are RTL, but some characters, say symbols, have no
direction.  For example, when you write:

    'A' '+' 'B'

It should be displayed as is ('+' is LTR), but when you write:

    'A' ALEF '+' LAM 'B'

The '+' should be displayed in the left side of ALEF ('+' is RTL):

    'A' LAM '+' ALEF 'B'

I think you need to detect all maximal non-LTR substrings (which don't
start or end with a symbol) inside LTR strings to render them correctly.
There are also RTL / LTR control characters in Unicode which manipulate
this behaviour.

> I'm not entirely sure this has much to do with ligatures, though.
> What matters for building ligatures is only the logical ordering,
> the ordering in *time* so to speak, i.e. what comes before and what
> comes after.  LAM before ALEF has to become the ligature glyph "al",
> whereas ALEF before LAM remains two glyphs.  Technically, the
> question of ordering in space, whether glyphs are painted onto the
> screen right to left or left to right, only comes into play after
> characters have already been combined into glyphs.
> 
> Actually, now that you bring up the topic, i see another situation
> where less(1) causes an issue.  Let's use konsole(1) and not xterm(1)
> such that we get the correct writing direction, and let's put the
> word "al" onto the screen.  No ligature here, so that part of the
> topic is suspended for a moment.  Now let's slowly scroll right in
> one-column steps.  All is fine as long as the word "al" is completely
> visible on screen.  But when the final letter LAM of "al" is in the
> last (leftmost) column of the screen and you scroll right one more
> column, something weird happens, even in konsole(1).  You would
> expect the final letter LAM to scroll off screen first and the initial
> letter ALEF to remain on the screen for a little longer.  Instead,
> less(1) incorrectly thinks the *initial* letter of the word scrolls
> off screen first, and it tells xterm(1) to display the ALEF in the
> leftmost column of the screen while the LAM just went off-screen.
> That looks weird because there is no word in that text beginning
> with ALEF.
> 

It's a difficult problem.  You need to consider all maximal non-LTR
substrings, and all LTR / RTL modifiers.  Also consider a file with long
RTL lines; user prefer to see the beginig of lines (in all languages,
readers read from start), so less(1) should display right-most part of
each line, and when user scrolls the text to right, less(1) should
display left-side of each line.

I think that if xterm had a complete RTL mode with swapped right and
left keys, it might solve many problems.  In your example in RTL xterm,
there will be no right scroll (because of swapped keys) and when you
scroll less(1) to the left, less(1) will correctly scrolls off the
initial letter.  Of course it will not work on complex mixed RTL / LTR
texts, but it solves the problem in most common situations.

> This means that being able to properly view Arabic or Farsi text
> with the default OpenBSD terminal emulator and parser would require
> 
>  1. bidi support in xterm(1)
>     to render Farsi words with the correct writing direction
>  2. ligature support in xterm(1)
>     to correctly connect letters
>  3. bidi support in less(1)
>     to correctly scroll parts of words on and off screen, horizontally

According to previous example (a file with long RTL lines), I don't
agree with bidi support in less(1).

>  4. ligature support in less(1)
>     for correct columnation
> 
> As far as i understand, you are saying that the extremely fragmentary
> support for item 4 which we happen to have right now is not really
> useful without items 1-3, and even when using konsole(1), which does
> have items 1 and 2, implementing item 3 before item 4 would make
> sense because item 3 is more importrant.
> 
> So my understanding is that you are not objecting to the patch because
> the fragmentary support for item 4 is practically useless in isolation.
> 
> 
> The following is not related to this patch, but i think it makes
> sense to mention it here: regarding the future, i think items 1 and
> 3 are much easier to support than items 2 and 4 because bidi support,
> if i understand correctly, only needs one bit of information per
> character because it only needs to know whether the character is
> part of a right to left or left to right script, so the complexity
> on the libc level, where we want complexity least of all places,
> is comparable to other boolean character properties like those
> listed in the iswalnum(3) manual page.  Realistically, though,
> bidi support would still be a large project, and i don't think it
> makes sense to tackle it any time soon.
> 

As mentioned before, each character might have 3 different direction:
LTR, RTL, and none.  So it needs at least 2 bits of information.  Also
you need to handle LEFT TO RIGHT MARK, RIGHT TO LEFT MARK, and other
direction control characters.

> Ligature support feels much worse than bidi support because the
> mapping required is not merely character -> boolean but (character +
> character) -> character, which is more complicated than even the
> (character + character) -> -1/0/+1 mapping required for collation
> support - and we decided that we don't want collation support in
> libc because it would cause excessive complexity.  Admittedly,
> collations are strongly locale-dependent, while i'm not sure ligatures
> are locale-dependent, so with some luck, they might be simpler in
> that respect.  But a pair-to-character mapping, even without locale
> dependency, still sounds so scary that i doubt we want it in libc
> even in the long term.
> 
> Thanks, you helped make the big picture a bit clearer for me.
> 
> Yours,
>   Ingo
> 
> 


Best Regards

Re: delete ligature support for Arabic "la" from the less(1) command line

Reply via email to