Re: Odp: Pd: Missing legacy Arabic encoding

[email protected] via Unicode Wed, 06 May 2026 09:44:42 -0700
Have you read the L2/26-077 proposal? Using ZWJ or ZWNJ would not work for the 
compatibility purposes at all as already explained in the proposal. This is 
because ZWJ or ZWNJ would take the space of one character tile in the CHAR_INFO 
structure. Suppose that you&#39;re trying to map 0xD0 from FP164 to a sequence 
of U+FE7C U+200D U+064B (ﹼ‍ً). The legacy application fills the 80×25 screen 
with all 0xD0 tiles. You subsequently try to capture the tiles with a Win32 
program by using ReadConsoleOutputA into an 80×25 buffer of 2000 tiles. This 
succeeds and captures 0xD0 into all the tiles. You then try to capture the 
tiles using ReadConsoleOutputW into an 80×25 buffer. Each sequence U+FE7C 
U+200D U+064B would take a sequence of three CHAR_INFO structures to store, 
meaning 6000 such structures for the whole screen. But the 80×25 buffer has 
only room for 2000 instances of the structure (one per character tile). Since 
CHAR_INFO stores 16-bit character code, by that same logic the compatibility 
characters would have to be in BMP for it to work. In Windows 95 Vietnamese and 
Windows 95 Thai, there are instances where one character tile takes multiple 
CHAR_INFO structures, causing visual width to be smaller than logical width, 
and when that happens, the remaining space at the end of the line is left 
blank, allowing for CP1258/CP874 combining characters in those systems to map 
1:1 to their Unicode equivalents. Windows 3.1/95/98/ME Arabic don&#39;t work 
that way and don&#39;t use combining characters or ZWJ sequences, so visual 
width is always equivalent to logical width, each character tile maps 1:1 to a 
CHAR_INFO structure and all characters may fill the entire line, which would be 
impossible if some of those characters were mapped to composition sequences or 
non-BMP characters. Since there is currently no sufficient evidence of user 
community that would need to use those mappings, there are no plans for those 
characters to be added to Unicode, and therefore the only solution for the 
ReadConsoleOutputW to work properly in this case is to use agreed upon private 
use mappings for those compatibility characters.  Dnia 06 maja 2026 18:06    
Philippe Verdy via Unicode  &lt; [email protected] &gt;  napisał(a): You 
actually don&#39;t need any new compatibility characters for Arabic contextual 
forms, or for other contextual forms in other joining scripts (like Adlam, or 
even Mongolian whichbis a LTR script).  You just have to prepend or append a 
ZWJ or ZWNJ formatting control to the unified letter if you want to override 
its default contextual presentation form.   Le mar. 5 mai 2026, 00:48, Asmus 
Freytag via Unicode &lt;  [email protected] &gt; a écrit :  The issue at 
hand is the distinction between a theoretical gap and real-life problem.  You 
have demonstrated that there are specifications that, if chained in the right 
way, can lead to ambiguities or gaps in interchange.  What we don&#39;t have is 
an actual use case with real-life consequences for a set of existing users, not 
hypothetical ones.  When it comes to encoding decisions based on existing 
documents, there is a strong presumption that once sufficiently many documents 
exist that contain a character, that this character will be needed in 
digitizing these documents, whether immediately, or eventually (e.g. in the 
case of future scholarly studies). Also, the texts themselves exist, barring 
accidents, in permanence. Therefore, it is justified to consider irrevocably 
allocating a character that will map to this source in perpetuity, even though 
each encoded character carries a small cost for implementers.   However, when 
it comes to legacy characters, there&#39;s an additional cost that is imposed, 
and that is based on the fact that characters that are encoded solely for 
compatibility will usually violate one or more of the other encoding 
principles, something that incrementally complicates the standard. Even for 
people who never intend to use that character.   Therefore, the SEW is on solid 
ground when it demands not only a hypothetical scenario, but evidence of actual 
impact on actual users. Not only whether some application could invoke an API, 
but whether such applications exist and are used today to access documents 
encoded using the legacy characters in a way that is compromised irreparably by 
not having an encoding for them.  A./   On 5/4/2026 10:24 AM,   
[email protected]  via Unicode wrote:  In UTC 187 Minutes, &#34; Asmus 
Freytag noted that the fact that lists of things existed in the past does not 
make these things plain text. Ned Holbrook pointed out that the purported issue 
occurs in a closed system, not in public interchange. &#34;. However, the 
arguments in the proposal do not merely hinge on the encodings being lists of 
characters, but specifically points out methods to interchange text, including 
an example of copying terminal output and pasting to Notepad, where the copying 
invokes the mapping of the current terminal codepage to UCS-2 (as is CHAR_INFO 
compatible) and the pasting writes it into plain text. Win32 is also not a 
closed system, as Win32 can capture the tiles of the output of Windows 3.1 
Arabic DOS/Win16 programs and Windows 95/98/ME Arabic DOS/Win16/Win32 programs, 
but Win32 can also interact with public text interchange systems by reading and 
writing to files and network. I&#39;m not saying that Unicode absolutely must 
include those characters, but those kinds of misleading claims are causing 
users to misunderstand what the proposal is about, and I don&#39;t want Unicode 
to be relying on uninformed decisions to evaluate proposals.  Dnia 18 kwietnia 
2026 13:36    [email protected] via Unicode    &lt; [email protected] 
&gt;  napisał(a): The SEW subsequently explained that the actual reason is due 
to insufficient evidence of user community that would need to use the resulting 
mapping. Despite Win32 being a highly popular platform with plenty of backwards 
compatibility and native UCS-2 terminal support, the specific use cases of 
installing codepages into Windows NT and using terminal tiles from Windows 
3.1/95/98/ME are not sufficiently documented, making it difficult for any user 
communities to form around it. So it seems like the idea of standardizing 
legacy Arabic terminal BMP mappings is a dead end for now.  Dnia 17 kwietnia 
2026 22:59    [email protected] via Unicode    &lt; [email protected] 
&gt;  napisał(a): The Recommendations in L2/26-100 claim that Microsoft&#39;s 
documentation of legacy Arabic encodings is available at  learn.microsoft.com 
https://learn.microsoft.com/en-us/typography/legacy/legacy_arabic_fonts . 
However, that article only demonstrates two encodings of TrueType fonts, which 
are used in Windows 3.1 but are completely different from the eight terminal 
encodings. Unlike the TrueType encodings which represent internal shaping 
mappings and are not used for text interchange, the terminal encodings have 
been demonstrated to be directly used in text interchange through int 10h and 
ReadConsoleOutputA/WriteConsoleOutputA as already demonstrated in L2/26-077. 
The Recommendations also claim that the proposal does not demonstrate any need 
for interchange or encoding, but the proposal actually demonstrated such a need 
due to the logical extension of the Win32 terminal API to the functions 
ReadConsoleOutputW/WriteConsoleOutputW, which are in Windows NT and may be used 
on the output of previously ran programs (including those that used the legacy 
Arabic terminal encodings), which given the CHAR_INFO structure, therefore 
implies a need for all the tiles to map to BMP for interchange. I&#39;m not 
objecting to the SEW&#39;s conclusion of &#34;Users are expected to use 
PUA.&#34;, which can indeed be used to provide a mapping even if not 
standardized, but the reasoning given was flawed.  Dnia 09 stycznia 2026 17:25  
  [email protected]    &lt; [email protected] &gt;  napisał(a): The 
following Win32 C code will output 256 characters in system console codepage 
into the character grid, capture those character tiles in UCS-2 if possible, 
and then output the current console codepage number.   #include 
&lt;windows.h&gt;  #include &lt;stdio.h&gt;  int main(){  HANDLE 
hConsole=GetStdHandle(STD_OUTPUT_HANDLE);  CHAR_INFO screen[256];  COORD 
size={16,16,};  COORD pos={0,0,};  SMALL_RECT rect={0,0,15,15,};  for(int 
i=0;i&lt;256;i++){  screen[i].Attributes=0xF0;  screen[i].Char.AsciiChar=i;  }  
WriteConsoleOutputA(hConsole,screen,size,pos,&amp;rect);  CHAR_INFO 
screenu[256];  if(ReadConsoleOutputW(hConsole,screenu,size,pos,&amp;rect)){  
for(int i=0;i&lt;256;i++) printf(&#34;%04X &#34;,screenu[i].Char.UnicodeChar);  
}  else{  printf(&#34;error %08X\n&#34;,GetLastError());  }  
printf(&#34;codepage %u&#34;,GetConsoleOutputCP());  }   In most cases, 
whenever a legacy Win32 codepage is used, the application can run on Windows NT 
to capture the UCS-2 mapping of those character cells to the BMP (although for 
CJK codepages a more complex setup would be necessary due to thousands of 
fullwidth characters with 2-byte sequences).   However, in Arabic versions of 
Windows 9x (95/98/ME) the resulting character set has many presentation forms 
that are not in Unicode. This is the result when running on Windows ME:  
i.imgur.com https://i.imgur.com/QFm3SkI.png  in 10×20 font,  i.imgur.com 
https://i.imgur.com/KUbLQ0A.png  in 10×18 font (same result also appears in 
Windows 95/98). 5×12, 7×12, 8×12, 10×18, 10×20, and 12×16 bitmap fonts have 
been attested with that character set (VGAOEM.FON, 8514OEM.FON, DOSAPP.FON). 
The 10×20 font has slightly different mapping than the other sizes: 0x93 is ö 
instead of ô, and 0x97 is missing (causing the following characters on the same 
line to be drawn at the wrong position). It also claims to be using codepage 
720, but many characters differ from their CP720 mappings, including the 
bundled CP_720.NLS mappings (for example, ـ (U+0640 ARABIC TATWEEL) is 0x95 in 
CP720, but in the console 0x95 is ش instead, and the tatweel is at 0xFF). On 
Windows 9x, ReadConsoleOutputW is not supported so the UCS-2 mappings of the 
console character tiles cannot be captured (error 0x00000078 
ERROR_CALL_NOT_IMPLEMENTED).   When that program runs on Arabic versions of 
Windows NT, the visual output is of the CP437 character set if one of the 
bundled bitmap fonts is used ( i.imgur.com https://i.imgur.com/RxjtxMH.png ), 
or the CP720 set if Lucida Console is used, with the Arabic letters either 
having glitchy font substitution (NT 4.0, NT 5.0/2000) or the .notdef glyph (NT 
5.1/XP and up). In fact, it seems that the only Arabic bitmap fonts that occur 
in Windows NT are CP1256 fonts, which are not used in terminals. So this 
appears to be one of those permanent Windows compatibility regressions that 
occured when Windows 9x ended, where the terminals can no longer render legacy 
Arabic text. Even if the user managed to use registry hacks to set the font to 
Courier New or Simplified Arabic Fixed, it would still use the CP720 mapping 
which is not compatible with the Windows 9x set.   It appears that in the 
Windows 9x Arabic terminal character set, 244 characters ( 
ﺀﺁﺂﺃﺄﺅﺇﺈﺊﺋﺍﺎﺏﺑﺓ►◄↕ﺕ¶§ﺗﺙ↑↓→←ﺛﹰ▲▼ 
!&#34;#$%&amp;&#39;()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ﺝﺟﺡéâﺣàﺥçêëèïîﺧﺩﺫﺭﺯôﺳûùﺷﺻ£ﺿﻁﻅﻉﻊﻋﻌ
 are already in Unicode, but 12 characters are not in Unicode:  • 6 of them are 
pieces of lam-alef ligatures (0xDD, 0xDE, 0xF9, 0xFB, 0xFC, 0xFD)  • 2 of them 
are shadda with fathatan ligatures without or with tatweel (0xD0, 0xD1)  — in 
some legacy Microsoft fonts, shadda with fathatan is mapped to private use 
U+E818  • 4 of them are disunifications of seen/sheen/sad/dad occuring either 
with or without tail  — ﹳ (U+FE73 ARABIC TAIL FRAGMENT) was originally encoded 
in Unicode 3.2 for CP864 compatibility; in that codepage, the forms of 
seen/sheen/sad/dad attach to the tail fragment  — forms with included tail: 
0x92, 0x95, 0x98, 0x8A  — forms without tail (attaching to tail fragment like 
in CP864): 0xF3, 0xF4, 0xF5, 0xF6   If someone tried to make a Win32 console 
implementation and tried to implement both Windows 9x Arabic terminal character 
set compatibility and wide string API (ReadConsoleOutputW) compatibility 
simultaneously, then they would run into the issue that there is currently no 
standardized mapping to handle that scenario. What should Windows 9x Arabic 
console compatible implementations do in that case?
Re: Odp: Pd: Missing legacy Arabic encoding

Reply via email to