Hi, i have to admit that i am neither able to speak nor to write nor to understand the Arabic language nor the Arabic script, but here is my current, probably incomplete understanding of what our less(1) program is trying to do with Arabic ligatures.
If somebody is reading this who is able to read and write Arabic or an Indian language heavily using ligatures, feedback is highly welcome. Arabic is a cursive script, which means that when writing Arabic, characters do not map 1:1 to glyphs. Instead, there are rules about how adjacent characters attach to each other, forming ligatures. As an extremely simple example, consider the Arabic adverb "la", which means the same as the English adverb "no". It consists of the two letters U+0644 LAM and U+0627 ALEF, the LAM appearing before (i.e. to the right of) the ALEF. However, you do not write both letters separately. Instead, the ALEF leans forward (to the left) and attaches to the LAM, forming the glyph U+FEFB, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM. When displayed in a fixed width font, that ligature only occupies a single display column just like any other Arabic or Latin glyph. The LAM WITH ALEF glyph is not a double-width glyph like Japanese or Chinese characters typically are. So, when this happens, you have four bytes of UTF-8 forming two Unicode characters, and *together*, these two characters occupy only one single display column. Note that in the default configuration, our xterm(1) is not able to display Arabic characters at all. But even when you run xterm -fa arabic or xterm -fa fixed which uses FreeType support instead of the default X toolkit font support, such that xterm(1) does become able to display single Arabic characters, it still displays the word "la" incorrectly, failing to generate the required ligature and instead displaying the two characters LAM and ALEF separately. So i installed konsole-18.12.0p1 for testing (which pulls in ridiculous amounts of dependencies, dozens of them, but oh well, i guess support for advanced Unicode features isn't trivial). The konsole(1) program does display the word "la" correctly, as a ligature. Now, running less(1) inside konsole(1), i found that columnation is already subtly broken. As long as the "la" ligature is visible on screen, all is fine. Now scroll to the right until the "la" appears in the first screen column. Then scroll one more column to the right by pressing "1 RIGHTARROW". Now you see *half* the ligature, i.e. an isolated ALEF, in the first column of the screen, even though the Arabic word does not contain an isolated ALEF. Besides, we just attempted to scroll the "la" off screen, so the ALEF now appears in the column one to the right of where the "la" should actually be, and all the rest of the line is shifted one column to the right, too, so columnation is now off by one. Scrolling back left, columnation recovers to correct display. I strongly suspect i broke that during my previous UTF-8 cleanup work on less(1). However, LAM WITH ALEF is literally the only ligature that less(1) supports, together with three variations (with MADDA above, with HAMZA above, and with HAMZA below). But there are hundreds of ligatures in Arabic, see https://www.unicode.org/charts/PDF/UFB50.pdf https://www.unicode.org/charts/PDF/UFE70.pdf I have no idea how many of those work in konsole(1) - but i'm sure none of those, except the four LAM WITH ALEF discussed here, work with less(1), so i think support for LAM WITH ALEF provided no value in the first place. The way it is implemented, with an ad-hoc table inside less(1) of character combinations that form ligatures, is just wrong and not sustainable by any stretch of the imagination, i think. On top of that, how characters combine in Arabic is strongly context dependent; even the syllable "la" forms a different ligature depending on whether it is isolated or at the end of a longer word, and none of the context dependencies are implemented in less(1) anyway. And finally, people say the situation in many Indian languages is even more dire than in Arabic, so what our less(1) tries to do is almost certainly completely useless for those languages, even if we would expand the ad-hoc table. So, i propose to delete support for combining characters into ligatures from our less(1): at this point, it is only used for typing at the less prompt anyway (and not for the file displayed), only for Arabic, and only for the single ligature "la". If we ever want better ligature support in the future, i think we would have to make a fresh start anyway - and i think there are many other things to do before that. Note that this only removes support for combining characters into ligatures that can also stand on their own; support for purely combining accents like U+300 COMBINING GRAVE ACCENT and U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK remains intact. OK? Ingo Index: charset.c =================================================================== RCS file: /cvs/src/usr.bin/less/charset.c,v retrieving revision 1.25 diff -u -p -r1.25 charset.c --- charset.c 31 Aug 2019 13:44:29 -0000 1.25 +++ charset.c 31 Aug 2019 21:30:25 -0000 @@ -474,13 +474,6 @@ static struct wchar_range comp_table[] = }; /* - * Special pairs, not ranges. - */ -static struct wchar_range comb_table[] = { - {0x0644, 0x0622}, {0x0644, 0x0623}, {0x0644, 0x0625}, {0x0644, 0x0627}, -}; - -/* * Characters with general category values * Cc: Other, Control * Cf: Other, Format @@ -825,22 +818,4 @@ is_wide_char(LWCHAR ch) { return (is_in_table(ch, wide_table, (sizeof (wide_table) / sizeof (*wide_table)))); -} - -/* - * Is a character a UTF-8 combining character? - * A combining char acts like an ordinary char, but if it follows - * a specific char (not any char), the two combine into one glyph. - */ -int -is_combining_char(LWCHAR ch1, LWCHAR ch2) -{ - /* The table is small; use linear search. */ - int i; - for (i = 0; i < sizeof (comb_table) / sizeof (*comb_table); i++) { - if (ch1 == comb_table[i].first && - ch2 == comb_table[i].last) - return (1); - } - return (0); } Index: cmdbuf.c =================================================================== RCS file: /cvs/src/usr.bin/less/cmdbuf.c,v retrieving revision 1.19 diff -u -p -r1.19 cmdbuf.c --- cmdbuf.c 28 Jun 2019 13:35:01 -0000 1.19 +++ cmdbuf.c 31 Aug 2019 21:30:26 -0000 @@ -179,19 +179,10 @@ cmd_step_common(char *p, LWCHAR ch, int if (bswidth != NULL) *bswidth = prlen; } else { - LWCHAR prev_ch = step_char(&p, -1, cmdbuf); - if (is_combining_char(prev_ch, ch)) { - if (pwidth != NULL) - *pwidth = 0; - if (bswidth != NULL) - *bswidth = 0; - } else { - if (pwidth != NULL) - *pwidth = is_wide_char(ch) - ? 2 : 1; - if (bswidth != NULL) - *bswidth = 1; - } + if (pwidth != NULL) + *pwidth = is_wide_char(ch) ? 2 : 1; + if (bswidth != NULL) + *bswidth = 1; } } } Index: funcs.h =================================================================== RCS file: /cvs/src/usr.bin/less/funcs.h,v retrieving revision 1.24 diff -u -p -r1.24 funcs.h --- funcs.h 31 Aug 2019 13:44:29 -0000 1.24 +++ funcs.h 31 Aug 2019 21:30:26 -0000 @@ -65,7 +65,6 @@ LWCHAR step_char(char **, int, char *); int is_composing_char(LWCHAR); int is_ubin_char(LWCHAR); int is_wide_char(LWCHAR); -int is_combining_char(LWCHAR, LWCHAR); void cmd_reset(void); void clear_cmd(void); void cmd_putstr(char *); Index: less.1 =================================================================== RCS file: /cvs/src/usr.bin/less/less.1,v retrieving revision 1.56 diff -u -p -r1.56 less.1 --- less.1 20 Aug 2019 11:34:18 -0000 1.56 +++ less.1 31 Aug 2019 21:30:26 -0000 @@ -1804,7 +1804,7 @@ Language for determining the character s The character encoding .Xr locale 1 . It decides which byte sequences form characters, what their display -width is, and which characters are composing or combining characters. +width is, and which characters are composing characters. .It Ev LESS Options which are passed to .Nm
