Re: Odp: Pd: Missing legacy Arabic encoding

Philippe Verdy via Unicode Wed, 06 May 2026 09:00:32 -0700

You actually don't need any new compatibility characters for Arabic
contextual forms, or for other contextual forms in other joining scripts
(like Adlam, or even Mongolian whichbis a LTR script).


You just have to prepend or append a ZWJ or ZWNJ formatting control to the
unified letter if you want to override its default contextual presentation
form.


Le mar. 5 mai 2026, 00:48, Asmus Freytag via Unicode <
[email protected]> a écrit :

> The issue at hand is the distinction between a theoretical gap and
> real-life problem.
>
> You have demonstrated that there are specifications that, if chained in
> the right way, can lead to ambiguities or gaps in interchange.
>
> What we don't have is an actual use case with real-life consequences for a
> set of existing users, not hypothetical ones.
>
> When it comes to encoding decisions based on existing documents, there is
> a strong presumption that once sufficiently many documents exist that
> contain a character, that this character will be needed in digitizing these
> documents, whether immediately, or eventually (e.g. in the case of future
> scholarly studies). Also, the texts themselves exist, barring accidents, in
> permanence. Therefore, it is justified to consider irrevocably allocating a
> character that will map to this source in perpetuity, even though each
> encoded character carries a small cost for implementers.
>
> However, when it comes to legacy characters, there's an additional cost
> that is imposed, and that is based on the fact that characters that are
> encoded solely for compatibility will usually violate one or more of the
> other encoding principles, something that incrementally complicates the
> standard. Even for people who never intend to use that character.
>
> Therefore, the SEW is on solid ground when it demands not only a
> hypothetical scenario, but evidence of actual impact on actual users. Not
> only whether some application could invoke an API, but whether such
> applications exist and are used today to access documents encoded using the
> legacy characters in a way that is compromised irreparably by not having an
> encoding for them.
>
> A./
>
>
> On 5/4/2026 10:24 AM, [email protected] via Unicode wrote:
>
> In UTC 187 Minutes, "Asmus Freytag noted that the fact that lists of
> things existed in the past does not make these things plain text. Ned
> Holbrook pointed out that the purported issue occurs in a closed system,
> not in public interchange.". However, the arguments in the proposal do
> not merely hinge on the encodings being lists of characters, but
> specifically points out methods to interchange text, including an example
> of copying terminal output and pasting to Notepad, where the copying
> invokes the mapping of the current terminal codepage to UCS-2 (as is
> CHAR_INFO compatible) and the pasting writes it into plain text. Win32 is
> also not a closed system, as Win32 can capture the tiles of the output of
> Windows 3.1 Arabic DOS/Win16 programs and Windows 95/98/ME Arabic
> DOS/Win16/Win32 programs, but Win32 can also interact with public text
> interchange systems by reading and writing to files and network. I'm not
> saying that Unicode absolutely must include those characters, but those
> kinds of misleading claims are causing users to misunderstand what the
> proposal is about, and I don't want Unicode to be relying on uninformed
> decisions to evaluate proposals.
>
>
> *Dnia 18 kwietnia 2026 13:36* [email protected] via Unicode
> <[email protected]> < [email protected] >
> <[email protected]> napisał(a):
>
> The SEW subsequently explained that the actual reason is due to
> insufficient evidence of user community that would need to use the
> resulting mapping. Despite Win32 being a highly popular platform with
> plenty of backwards compatibility and native UCS-2 terminal support, the
> specific use cases of installing codepages into Windows NT and using
> terminal tiles from Windows 3.1/95/98/ME are not sufficiently documented,
> making it difficult for any user communities to form around it. So it seems
> like the idea of standardizing legacy Arabic terminal BMP mappings is a
> dead end for now.
>
>
> *Dnia 17 kwietnia 2026 22:59* [email protected] via Unicode
> <[email protected]> < [email protected] >
> <[email protected]> napisał(a):
>
> The Recommendations in L2/26-100 claim that Microsoft's documentation of
> legacy Arabic encodings is available at
> https://learn.microsoft.com/en-us/typography/legacy/legacy_arabic_fonts.
> However, that article only demonstrates two encodings of TrueType fonts,
> which are used in Windows 3.1 but are completely different from the eight
> terminal encodings. Unlike the TrueType encodings which represent internal
> shaping mappings and are not used for text interchange, the terminal
> encodings have been demonstrated to be directly used in text interchange
> through int 10h and ReadConsoleOutputA/WriteConsoleOutputA as already
> demonstrated in L2/26-077. The Recommendations also claim that the proposal
> does not demonstrate any need for interchange or encoding, but the proposal
> actually demonstrated such a need due to the logical extension of the Win32
> terminal API to the functions ReadConsoleOutputW/WriteConsoleOutputW, which
> are in Windows NT and may be used on the output of previously ran programs
> (including those that used the legacy Arabic terminal encodings), which
> given the CHAR_INFO structure, therefore implies a need for all the tiles
> to map to BMP for interchange. I'm not objecting to the SEW's conclusion of
> "Users are expected to use PUA.", which can indeed be used to provide a
> mapping even if not standardized, but the reasoning given was flawed.
>
>
> *Dnia 09 stycznia 2026 17:25* [email protected] < [email protected]
> > <[email protected]> napisał(a):
>
> The following Win32 C code will output 256 characters in system console
> codepage into the character grid, capture those character tiles in UCS-2 if
> possible, and then output the current console codepage number.
>
>
> #include <windows.h>
> #include <stdio.h>
> int main(){
> HANDLE hConsole=GetStdHandle(STD_OUTPUT_HANDLE);
> CHAR_INFO screen[256];
> COORD size={16,16,};
> COORD pos={0,0,};
> SMALL_RECT rect={0,0,15,15,};
> for(int i=0;i<256;i++){
> screen[i].Attributes=0xF0;
> screen[i].Char.AsciiChar=i;
> }
> WriteConsoleOutputA(hConsole,screen,size,pos,&rect);
> CHAR_INFO screenu[256];
> if(ReadConsoleOutputW(hConsole,screenu,size,pos,&rect)){
> for(int i=0;i<256;i++) printf("%04X ",screenu[i].Char.UnicodeChar);
> }
> else{
> printf("error %08X\n",GetLastError());
> }
> printf("codepage %u",GetConsoleOutputCP());
> }
>
> In most cases, whenever a legacy Win32 codepage is used, the application
> can run on Windows NT to capture the UCS-2 mapping of those character cells
> to the BMP (although for CJK codepages a more complex setup would be
> necessary due to thousands of fullwidth characters with 2-byte sequences).
>
>
> However, in Arabic versions of Windows 9x (95/98/ME) the resulting
> character set has many presentation forms that are not in Unicode. This is
> the result when running on Windows ME: https://i.imgur.com/QFm3SkI.png in
> 10×20 font, https://i.imgur.com/KUbLQ0A.png in 10×18 font (same result
> also appears in Windows 95/98). 5×12, 7×12, 8×12, 10×18, 10×20, and 12×16
> bitmap fonts have been attested with that character set (VGAOEM.FON,
> 8514OEM.FON, DOSAPP.FON). The 10×20 font has slightly different mapping
> than the other sizes: 0x93 is ö instead of ô, and 0x97 is missing (causing
> the following characters on the same line to be drawn at the wrong
> position). It also claims to be using codepage 720, but many characters
> differ from their CP720 mappings, including the bundled CP_720.NLS mappings
> (for example, ـ (U+0640 ARABIC TATWEEL) is 0x95 in CP720, but in the
> console 0x95 is ش instead, and the tatweel is at 0xFF). On Windows
> 9x, ReadConsoleOutputW is not supported so the UCS-2 mappings of the
> console character tiles cannot be captured (error 0x00000078
> ERROR_CALL_NOT_IMPLEMENTED).
>
>
> When that program runs on Arabic versions of Windows NT, the visual output
> is of the CP437 character set if one of the bundled bitmap fonts is used (
> https://i.imgur.com/RxjtxMH.png), or the CP720 set if Lucida Console is
> used, with the Arabic letters either having glitchy font substitution (NT
> 4.0, NT 5.0/2000) or the .notdef glyph (NT 5.1/XP and up). In fact, it
> seems that the only Arabic bitmap fonts that occur in Windows NT are CP1256
> fonts, which are not used in terminals. So this appears to be one of those
> permanent Windows compatibility regressions that occured when Windows 9x
> ended, where the terminals can no longer render legacy Arabic text. Even if
> the user managed to use registry hacks to set the font to Courier New or
> Simplified Arabic Fixed, it would still use the CP720 mapping which is not
> compatible with the Windows 9x set.
>
>
> It appears that in the Windows 9x Arabic terminal character set, 244
> characters ( ﺀﺁﺂﺃﺄﺅﺇﺈﺊﺋﺍﺎﺏﺑﺓ►◄↕ﺕ¶§ﺗﺙ↑↓→←ﺛﹰ▲▼
> !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ﺝﺟﺡéâﺣàﺥçêëèïîﺧﺩﺫﺭﺯôﺳûùﺷﺻ£ﺿﻁﻅﻉﻊﻋﻌﻍﻎﻏﻐﻑﻓﻕﻗﻙﻛ«»ﹱ▒ﹲ│┤ﹴﹶﹷﹸ٠١٢٣ﹹﹺ┐└┴┬├─┼ﹻﹾ٤٥٦٧٨٩،ﹿﱞﱟﱠﳲﱡﳳﱢ┘┌؛؟¤ﻝﻟﻡﻣﻥﻧµﻩﻫﻬﻭﻯﻰﻱﻲﻳﳴﹼﹽﺱﺵﺹﺽﹳ°·■ـ)
> are already in Unicode, but 12 characters are not in Unicode:
>
> • 6 of them are pieces of lam-alef ligatures (0xDD, 0xDE, 0xF9, 0xFB,
> 0xFC, 0xFD)
>
> • 2 of them are shadda with fathatan ligatures without or with tatweel
> (0xD0, 0xD1)
>
> — in some legacy Microsoft fonts, shadda with fathatan is mapped to
> private use U+E818
>
> • 4 of them are disunifications of seen/sheen/sad/dad occuring either with
> or without tail
>
> — ﹳ (U+FE73 ARABIC TAIL FRAGMENT) was originally encoded in Unicode 3.2
> for CP864 compatibility; in that codepage, the forms of seen/sheen/sad/dad
> attach to the tail fragment
>
> — forms with included tail: 0x92, 0x95, 0x98, 0x8A
>
> — forms without tail (attaching to tail fragment like in CP864): 0xF3,
> 0xF4, 0xF5, 0xF6
>
>
> If someone tried to make a Win32 console implementation and tried to
> implement both Windows 9x Arabic terminal character set compatibility and
> wide string API (ReadConsoleOutputW) compatibility simultaneously, then
> they would run into the issue that there is currently no standardized
> mapping to handle that scenario. What should Windows 9x Arabic console
> compatible implementations do in that case?
>
>
>
>
>
>
>

Re: Odp: Pd: Missing legacy Arabic encoding

Reply via email to