You actually don't need any new compatibility characters for Arabic contextual forms, or for other contextual forms in other joining scripts (like Adlam, or even Mongolian whichbis a LTR script).
You just have to prepend or append a ZWJ or ZWNJ formatting control to the unified letter if you want to override its default contextual presentation form. Le mar. 5 mai 2026, 00:48, Asmus Freytag via Unicode < [email protected]> a écrit : > The issue at hand is the distinction between a theoretical gap and > real-life problem. > > You have demonstrated that there are specifications that, if chained in > the right way, can lead to ambiguities or gaps in interchange. > > What we don't have is an actual use case with real-life consequences for a > set of existing users, not hypothetical ones. > > When it comes to encoding decisions based on existing documents, there is > a strong presumption that once sufficiently many documents exist that > contain a character, that this character will be needed in digitizing these > documents, whether immediately, or eventually (e.g. in the case of future > scholarly studies). Also, the texts themselves exist, barring accidents, in > permanence. Therefore, it is justified to consider irrevocably allocating a > character that will map to this source in perpetuity, even though each > encoded character carries a small cost for implementers. > > However, when it comes to legacy characters, there's an additional cost > that is imposed, and that is based on the fact that characters that are > encoded solely for compatibility will usually violate one or more of the > other encoding principles, something that incrementally complicates the > standard. Even for people who never intend to use that character. > > Therefore, the SEW is on solid ground when it demands not only a > hypothetical scenario, but evidence of actual impact on actual users. Not > only whether some application could invoke an API, but whether such > applications exist and are used today to access documents encoded using the > legacy characters in a way that is compromised irreparably by not having an > encoding for them. > > A./ > > > On 5/4/2026 10:24 AM, [email protected] via Unicode wrote: > > In UTC 187 Minutes, "Asmus Freytag noted that the fact that lists of > things existed in the past does not make these things plain text. Ned > Holbrook pointed out that the purported issue occurs in a closed system, > not in public interchange.". However, the arguments in the proposal do > not merely hinge on the encodings being lists of characters, but > specifically points out methods to interchange text, including an example > of copying terminal output and pasting to Notepad, where the copying > invokes the mapping of the current terminal codepage to UCS-2 (as is > CHAR_INFO compatible) and the pasting writes it into plain text. Win32 is > also not a closed system, as Win32 can capture the tiles of the output of > Windows 3.1 Arabic DOS/Win16 programs and Windows 95/98/ME Arabic > DOS/Win16/Win32 programs, but Win32 can also interact with public text > interchange systems by reading and writing to files and network. I'm not > saying that Unicode absolutely must include those characters, but those > kinds of misleading claims are causing users to misunderstand what the > proposal is about, and I don't want Unicode to be relying on uninformed > decisions to evaluate proposals. > > > *Dnia 18 kwietnia 2026 13:36* [email protected] via Unicode > <[email protected]> < [email protected] > > <[email protected]> napisał(a): > > The SEW subsequently explained that the actual reason is due to > insufficient evidence of user community that would need to use the > resulting mapping. Despite Win32 being a highly popular platform with > plenty of backwards compatibility and native UCS-2 terminal support, the > specific use cases of installing codepages into Windows NT and using > terminal tiles from Windows 3.1/95/98/ME are not sufficiently documented, > making it difficult for any user communities to form around it. So it seems > like the idea of standardizing legacy Arabic terminal BMP mappings is a > dead end for now. > > > *Dnia 17 kwietnia 2026 22:59* [email protected] via Unicode > <[email protected]> < [email protected] > > <[email protected]> napisał(a): > > The Recommendations in L2/26-100 claim that Microsoft's documentation of > legacy Arabic encodings is available at > https://learn.microsoft.com/en-us/typography/legacy/legacy_arabic_fonts. > However, that article only demonstrates two encodings of TrueType fonts, > which are used in Windows 3.1 but are completely different from the eight > terminal encodings. Unlike the TrueType encodings which represent internal > shaping mappings and are not used for text interchange, the terminal > encodings have been demonstrated to be directly used in text interchange > through int 10h and ReadConsoleOutputA/WriteConsoleOutputA as already > demonstrated in L2/26-077. The Recommendations also claim that the proposal > does not demonstrate any need for interchange or encoding, but the proposal > actually demonstrated such a need due to the logical extension of the Win32 > terminal API to the functions ReadConsoleOutputW/WriteConsoleOutputW, which > are in Windows NT and may be used on the output of previously ran programs > (including those that used the legacy Arabic terminal encodings), which > given the CHAR_INFO structure, therefore implies a need for all the tiles > to map to BMP for interchange. I'm not objecting to the SEW's conclusion of > "Users are expected to use PUA.", which can indeed be used to provide a > mapping even if not standardized, but the reasoning given was flawed. > > > *Dnia 09 stycznia 2026 17:25* [email protected] < [email protected] > > <[email protected]> napisał(a): > > The following Win32 C code will output 256 characters in system console > codepage into the character grid, capture those character tiles in UCS-2 if > possible, and then output the current console codepage number. > > > #include <windows.h> > #include <stdio.h> > int main(){ > HANDLE hConsole=GetStdHandle(STD_OUTPUT_HANDLE); > CHAR_INFO screen[256]; > COORD size={16,16,}; > COORD pos={0,0,}; > SMALL_RECT rect={0,0,15,15,}; > for(int i=0;i<256;i++){ > screen[i].Attributes=0xF0; > screen[i].Char.AsciiChar=i; > } > WriteConsoleOutputA(hConsole,screen,size,pos,&rect); > CHAR_INFO screenu[256]; > if(ReadConsoleOutputW(hConsole,screenu,size,pos,&rect)){ > for(int i=0;i<256;i++) printf("%04X ",screenu[i].Char.UnicodeChar); > } > else{ > printf("error %08X\n",GetLastError()); > } > printf("codepage %u",GetConsoleOutputCP()); > } > > In most cases, whenever a legacy Win32 codepage is used, the application > can run on Windows NT to capture the UCS-2 mapping of those character cells > to the BMP (although for CJK codepages a more complex setup would be > necessary due to thousands of fullwidth characters with 2-byte sequences). > > > However, in Arabic versions of Windows 9x (95/98/ME) the resulting > character set has many presentation forms that are not in Unicode. This is > the result when running on Windows ME: https://i.imgur.com/QFm3SkI.png in > 10×20 font, https://i.imgur.com/KUbLQ0A.png in 10×18 font (same result > also appears in Windows 95/98). 5×12, 7×12, 8×12, 10×18, 10×20, and 12×16 > bitmap fonts have been attested with that character set (VGAOEM.FON, > 8514OEM.FON, DOSAPP.FON). The 10×20 font has slightly different mapping > than the other sizes: 0x93 is ö instead of ô, and 0x97 is missing (causing > the following characters on the same line to be drawn at the wrong > position). It also claims to be using codepage 720, but many characters > differ from their CP720 mappings, including the bundled CP_720.NLS mappings > (for example, ـ (U+0640 ARABIC TATWEEL) is 0x95 in CP720, but in the > console 0x95 is ش instead, and the tatweel is at 0xFF). On Windows > 9x, ReadConsoleOutputW is not supported so the UCS-2 mappings of the > console character tiles cannot be captured (error 0x00000078 > ERROR_CALL_NOT_IMPLEMENTED). > > > When that program runs on Arabic versions of Windows NT, the visual output > is of the CP437 character set if one of the bundled bitmap fonts is used ( > https://i.imgur.com/RxjtxMH.png), or the CP720 set if Lucida Console is > used, with the Arabic letters either having glitchy font substitution (NT > 4.0, NT 5.0/2000) or the .notdef glyph (NT 5.1/XP and up). In fact, it > seems that the only Arabic bitmap fonts that occur in Windows NT are CP1256 > fonts, which are not used in terminals. So this appears to be one of those > permanent Windows compatibility regressions that occured when Windows 9x > ended, where the terminals can no longer render legacy Arabic text. Even if > the user managed to use registry hacks to set the font to Courier New or > Simplified Arabic Fixed, it would still use the CP720 mapping which is not > compatible with the Windows 9x set. > > > It appears that in the Windows 9x Arabic terminal character set, 244 > characters ( ﺀﺁﺂﺃﺄﺅﺇﺈﺊﺋﺍﺎﺏﺑﺓ►◄↕ﺕ¶§ﺗﺙ↑↓→←ﺛﹰ▲▼ > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ﺝﺟﺡéâﺣàﺥçêëèïîﺧﺩﺫﺭﺯôﺳûùﺷﺻ£ﺿﻁﻅﻉﻊﻋﻌﻍﻎﻏﻐﻑﻓﻕﻗﻙﻛ«»ﹱ▒ﹲ│┤ﹴﹶﹷﹸ٠١٢٣ﹹﹺ┐└┴┬├─┼ﹻﹾ٤٥٦٧٨٩،ﹿﱞﱟﱠﳲﱡﳳﱢ┘┌؛؟¤ﻝﻟﻡﻣﻥﻧµﻩﻫﻬﻭﻯﻰﻱﻲﻳﳴﹼﹽﺱﺵﺹﺽﹳ°·■ـ) > are already in Unicode, but 12 characters are not in Unicode: > > • 6 of them are pieces of lam-alef ligatures (0xDD, 0xDE, 0xF9, 0xFB, > 0xFC, 0xFD) > > • 2 of them are shadda with fathatan ligatures without or with tatweel > (0xD0, 0xD1) > > — in some legacy Microsoft fonts, shadda with fathatan is mapped to > private use U+E818 > > • 4 of them are disunifications of seen/sheen/sad/dad occuring either with > or without tail > > — ﹳ (U+FE73 ARABIC TAIL FRAGMENT) was originally encoded in Unicode 3.2 > for CP864 compatibility; in that codepage, the forms of seen/sheen/sad/dad > attach to the tail fragment > > — forms with included tail: 0x92, 0x95, 0x98, 0x8A > > — forms without tail (attaching to tail fragment like in CP864): 0xF3, > 0xF4, 0xF5, 0xF6 > > > If someone tried to make a Win32 console implementation and tried to > implement both Windows 9x Arabic terminal character set compatibility and > wide string API (ReadConsoleOutputW) compatibility simultaneously, then > they would run into the issue that there is currently no standardized > mapping to handle that scenario. What should Windows 9x Arabic console > compatible implementations do in that case? > > > > > > >
