delete ligature support for Arabic "la" from the less(1) command line

Ingo Schwarze Sat, 31 Aug 2019 15:33:41 -0700

Hi,

i have to admit that i am neither able to speak nor to write nor
to understand the Arabic language nor the Arabic script, but here
is my current, probably incomplete understanding of what our less(1)
program is trying to do with Arabic ligatures.


If somebody is reading this who is able to read and write Arabic
or an Indian language heavily using ligatures, feedback is highly
welcome.

Arabic is a cursive script, which means that when writing Arabic,
characters do not map 1:1 to glyphs.  Instead, there are rules about
how adjacent characters attach to each other, forming ligatures.

As an extremely simple example, consider the Arabic adverb "la",
which means the same as the English adverb "no".  It consists of
the two letters U+0644 LAM and U+0627 ALEF, the LAM appearing before
(i.e. to the right of) the ALEF.  However, you do not write both
letters separately.  Instead, the ALEF leans forward (to the left)
and attaches to the LAM, forming the glyph U+FEFB, ARABIC LIGATURE
LAM WITH ALEF ISOLATED FORM.  When displayed in a fixed width font,
that ligature only occupies a single display column just like any
other Arabic or Latin glyph.  The LAM WITH ALEF glyph is not a
double-width glyph like Japanese or Chinese characters typically
are.

So, when this happens, you have four bytes of UTF-8 forming two
Unicode characters, and *together*, these two characters occupy
only one single display column.

Note that in the default configuration, our xterm(1) is not able
to display Arabic characters at all.  But even when you run
  xterm -fa arabic
or
  xterm -fa fixed
which uses FreeType support instead of the default X toolkit font
support, such that xterm(1) does become able to display single
Arabic characters, it still displays the word "la" incorrectly,
failing to generate the required ligature and instead displaying
the two characters LAM and ALEF separately.

So i installed konsole-18.12.0p1 for testing (which pulls in
ridiculous amounts of dependencies, dozens of them, but oh well,
i guess support for advanced Unicode features isn't trivial).
The konsole(1) program does display the word "la" correctly, as a
ligature.

Now, running less(1) inside konsole(1), i found that columnation
is already subtly broken.  As long as the "la" ligature is visible
on screen, all is fine.  Now scroll to the right until the "la"
appears in the first screen column.  Then scroll one more column
to the right by pressing "1 RIGHTARROW".  Now you see *half* the
ligature, i.e. an isolated ALEF, in the first column of the screen,
even though the Arabic word does not contain an isolated ALEF.
Besides, we just attempted to scroll the "la" off screen, so the
ALEF now appears in the column one to the right of where the "la"
should actually be, and all the rest of the line is shifted one
column to the right, too, so columnation is now off by one.
Scrolling back left, columnation recovers to correct display.

I strongly suspect i broke that during my previous UTF-8 cleanup
work on less(1).

However, LAM WITH ALEF is literally the only ligature that less(1)
supports, together with three variations (with MADDA above, with
HAMZA above, and with HAMZA below).  But there are hundreds of
ligatures in Arabic, see

  https://www.unicode.org/charts/PDF/UFB50.pdf
  https://www.unicode.org/charts/PDF/UFE70.pdf

I have no idea how many of those work in konsole(1) - but i'm sure
none of those, except the four LAM WITH ALEF discussed here, work
with less(1), so i think support for LAM WITH ALEF provided no value
in the first place.  The way it is implemented, with an ad-hoc table
inside less(1) of character combinations that form ligatures, is
just wrong and not sustainable by any stretch of the imagination,
i think.

On top of that, how characters combine in Arabic is strongly context
dependent; even the syllable "la" forms a different ligature depending
on whether it is isolated or at the end of a longer word, and none
of the context dependencies are implemented in less(1) anyway.

And finally, people say the situation in many Indian languages is
even more dire than in Arabic, so what our less(1) tries to do is
almost certainly completely useless for those languages, even if
we would expand the ad-hoc table.

So, i propose to delete support for combining characters into
ligatures from our less(1): at this point, it is only used for
typing at the less prompt anyway (and not for the file displayed),
only for Arabic, and only for the single ligature "la".  If we ever
want better ligature support in the future, i think we would have
to make a fresh start anyway - and i think there are many other
things to do before that.

Note that this only removes support for combining characters into
ligatures that can also stand on their own; support for purely
combining accents like U+300 COMBINING GRAVE ACCENT and U+3099
COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK remains intact.

OK?
  Ingo


Index: charset.c
===================================================================
RCS file: /cvs/src/usr.bin/less/charset.c,v
retrieving revision 1.25
diff -u -p -r1.25 charset.c
--- charset.c   31 Aug 2019 13:44:29 -0000      1.25
+++ charset.c   31 Aug 2019 21:30:25 -0000
@@ -474,13 +474,6 @@ static struct wchar_range comp_table[] =
 };
 
 /*
- * Special pairs, not ranges.
- */
-static struct wchar_range comb_table[] = {
-       {0x0644, 0x0622}, {0x0644, 0x0623}, {0x0644, 0x0625}, {0x0644, 0x0627},
-};
-
-/*
  * Characters with general category values
  *     Cc: Other, Control
  *     Cf: Other, Format
@@ -825,22 +818,4 @@ is_wide_char(LWCHAR ch)
 {
        return (is_in_table(ch, wide_table,
            (sizeof (wide_table) / sizeof (*wide_table))));
-}
-
-/*
- * Is a character a UTF-8 combining character?
- * A combining char acts like an ordinary char, but if it follows
- * a specific char (not any char), the two combine into one glyph.
- */
-int
-is_combining_char(LWCHAR ch1, LWCHAR ch2)
-{
-       /* The table is small; use linear search. */
-       int i;
-       for (i = 0; i < sizeof (comb_table) / sizeof (*comb_table); i++) {
-               if (ch1 == comb_table[i].first &&
-                   ch2 == comb_table[i].last)
-                       return (1);
-       }
-       return (0);
 }
Index: cmdbuf.c
===================================================================
RCS file: /cvs/src/usr.bin/less/cmdbuf.c,v
retrieving revision 1.19
diff -u -p -r1.19 cmdbuf.c
--- cmdbuf.c    28 Jun 2019 13:35:01 -0000      1.19
+++ cmdbuf.c    31 Aug 2019 21:30:26 -0000
@@ -179,19 +179,10 @@ cmd_step_common(char *p, LWCHAR ch, int 
                                if (bswidth != NULL)
                                        *bswidth = prlen;
                        } else {
-                               LWCHAR prev_ch = step_char(&p, -1, cmdbuf);
-                               if (is_combining_char(prev_ch, ch)) {
-                                       if (pwidth != NULL)
-                                               *pwidth = 0;
-                                       if (bswidth != NULL)
-                                               *bswidth = 0;
-                               } else {
-                                       if (pwidth != NULL)
-                                               *pwidth = is_wide_char(ch)
-                                                   ? 2 : 1;
-                                       if (bswidth != NULL)
-                                               *bswidth = 1;
-                               }
+                               if (pwidth != NULL)
+                                       *pwidth = is_wide_char(ch) ? 2 : 1;
+                               if (bswidth != NULL)
+                                       *bswidth = 1;
                        }
                }
        }
Index: funcs.h
===================================================================
RCS file: /cvs/src/usr.bin/less/funcs.h,v
retrieving revision 1.24
diff -u -p -r1.24 funcs.h
--- funcs.h     31 Aug 2019 13:44:29 -0000      1.24
+++ funcs.h     31 Aug 2019 21:30:26 -0000
@@ -65,7 +65,6 @@ LWCHAR step_char(char **, int, char *);
 int is_composing_char(LWCHAR);
 int is_ubin_char(LWCHAR);
 int is_wide_char(LWCHAR);
-int is_combining_char(LWCHAR, LWCHAR);
 void cmd_reset(void);
 void clear_cmd(void);
 void cmd_putstr(char *);
Index: less.1
===================================================================
RCS file: /cvs/src/usr.bin/less/less.1,v
retrieving revision 1.56
diff -u -p -r1.56 less.1
--- less.1      20 Aug 2019 11:34:18 -0000      1.56
+++ less.1      31 Aug 2019 21:30:26 -0000
@@ -1804,7 +1804,7 @@ Language for determining the character s
 The character encoding
 .Xr locale 1 .
 It decides which byte sequences form characters, what their display
-width is, and which characters are composing or combining characters.
+width is, and which characters are composing characters.
 .It Ev LESS
 Options which are passed to
 .Nm

delete ligature support for Arabic "la" from the less(1) command line

Reply via email to