Re: delete ligature support for Arabic "la" from the less(1) command line

Ingo Schwarze Sun, 01 Sep 2019 01:52:03 -0700

Hello Mohammadreza,

Mohammadreza Abdollahzadeh wrote on Sun, Sep 01, 2019 at 09:40:16AM +0430:


> Persian is my native language and I think that the major problem that
> all RTL (Right-To-Left) languages like Persian and Arabic currentlly suffer
> from is the lack of BiDi (Bidirectionality) support in console and terminal
> environment like xterm(1). KDE konsole(1) support bidi and that's why it
> show ligatures correctly.
> I think any attempt to fix such problems must first start with adding bidi
> support to xterm and other terminal environment.

Thank you for your feedback!

If i understand correctly, xterm(1) does indeed have that problem.
I prepared a test file that contains, in this order,

 - some Latin characters
 - the Arabic word "la" ("no"), i.e. first LAM, then ALEF
 - some more Latin characters
 - the Arabic word "al" ("the"), i.e. first ALEF, then LAM
 - some final Latin characters

And indeed, xterm(1) does not respect the writing direction of the
individual words.  When cat(1)'ing the file to stdout, both xterm(1)
and konsole(1) show all the words from left to right, but *inside*
each word, konsole(1) uses the correct writing direction: right to
left for Arabic and left to right for Latin.  For example, in the
Arabic word "al", konsole(1) correctly shows the ALEF right of the
LAM, whereas xterm(1) wrongly shows the ALEF left of the LAM.

I'm not entirely sure this has much to do with ligatures, though.
What matters for building ligatures is only the logical ordering,
the ordering in *time* so to speak, i.e. what comes before and what
comes after.  LAM before ALEF has to become the ligature glyph "al",
whereas ALEF before LAM remains two glyphs.  Technically, the
question of ordering in space, whether glyphs are painted onto the
screen right to left or left to right, only comes into play after
characters have already been combined into glyphs.

Actually, now that you bring up the topic, i see another situation
where less(1) causes an issue.  Let's use konsole(1) and not xterm(1)
such that we get the correct writing direction, and let's put the
word "al" onto the screen.  No ligature here, so that part of the
topic is suspended for a moment.  Now let's slowly scroll right in
one-column steps.  All is fine as long as the word "al" is completely
visible on screen.  But when the final letter LAM of "al" is in the
last (leftmost) column of the screen and you scroll right one more
column, something weird happens, even in konsole(1).  You would
expect the final letter LAM to scroll off screen first and the initial
letter ALEF to remain on the screen for a little longer.  Instead,
less(1) incorrectly thinks the *initial* letter of the word scrolls
off screen first, and it tells xterm(1) to display the ALEF in the
leftmost column of the screen while the LAM just went off-screen.
That looks weird because there is no word in that text beginning
with ALEF.

This means that being able to properly view Arabic or Farsi text
with the default OpenBSD terminal emulator and parser would require

 1. bidi support in xterm(1)
    to render Farsi words with the correct writing direction
 2. ligature support in xterm(1)
    to correctly connect letters
 3. bidi support in less(1)
    to correctly scroll parts of words on and off screen, horizontally
 4. ligature support in less(1)
    for correct columnation

As far as i understand, you are saying that the extremely fragmentary
support for item 4 which we happen to have right now is not really
useful without items 1-3, and even when using konsole(1), which does
have items 1 and 2, implementing item 3 before item 4 would make
sense because item 3 is more importrant.

So my understanding is that you are not objecting to the patch because
the fragmentary support for item 4 is practically useless in isolation.


The following is not related to this patch, but i think it makes
sense to mention it here: regarding the future, i think items 1 and
3 are much easier to support than items 2 and 4 because bidi support,
if i understand correctly, only needs one bit of information per
character because it only needs to know whether the character is
part of a right to left or left to right script, so the complexity
on the libc level, where we want complexity least of all places,
is comparable to other boolean character properties like those
listed in the iswalnum(3) manual page.  Realistically, though,
bidi support would still be a large project, and i don't think it
makes sense to tackle it any time soon.

Ligature support feels much worse than bidi support because the
mapping required is not merely character -> boolean but (character +
character) -> character, which is more complicated than even the
(character + character) -> -1/0/+1 mapping required for collation
support - and we decided that we don't want collation support in
libc because it would cause excessive complexity.  Admittedly,
collations are strongly locale-dependent, while i'm not sure ligatures
are locale-dependent, so with some luck, they might be simpler in
that respect.  But a pair-to-character mapping, even without locale
dependency, still sounds so scary that i doubt we want it in libc
even in the long term.

Thanks, you helped make the big picture a bit clearer for me.

Yours,
  Ingo

Re: delete ligature support for Arabic "la" from the less(1) command line

Reply via email to