RE: Odp: Pd: Missing legacy Arabic encoding

[email protected] via Unicode Thu, 07 May 2026 12:47:14 -0700
Dnia 07 maja 2026 20:46    Peter Constable via Unicode  &lt; 
[email protected] &gt;  napisał(a): &gt; This is not of any interest, 
because the current Microsoft company is not even close ideologically to the 
past version of Microsoft that originally made the Arabic terminals, in fact 
they&#39;re not even ideologically compatible with each other.  Of course it 
might not be of interest to you. You don’t seem interested in others’ reasoning 
about this unless it aligns to how you are thinking about it. I didn&#39;t 
think that the evidence of user community in the proposal was insufficient at 
first, but once that was explained to me, I accepted that reasoning. In fact, I 
evaluate users&#39; feedback all the time when I develop fonts, other software, 
and proposals, and I make changes to fit new evidence. When you brought up a 
hypothetical Microsoft voting result, I didn&#39;t think you were making any 
argument at all, because back then you haven&#39;t explained why specifically 
Microsoft would vote that way or how that specifically relates to the contents 
of the proposal, making it not a constructive claim at first.  One point about 
current MS not being even close “ideologically” to past MS (very long past) 
that _ I _ think is of interest is that, for UTC to decide to add new legacy 
characters, one consideration is what the new characters will entail for 
product support and interoperability. There’s a cost / benefit analysis to be 
done. In the 1980s, there was a reason why legacy implementations made sense. 
But today, the cost / benefit analysis doesn’t weight at all in favour of 
encoding new legacy characters: Microsoft has been supporting Unicode in 
products for over 30 years, and in all that time there hasn’t been any 
identifiable customer need for encoding these additional legacy Arabic text 
elements as separate characters   —   i.e., benefits are extremely low to zero. 
But the costs for the new characters would certainly not be zero.  You  might 
think this is a false argument. Others might think otherwise.  By false 
arguments I meant things like claiming that the characters can be represented 
with ZWJ/ZWNJ sequences or that the claims that Microsoft&#39;s documentation 
of legacy Arabic encodings is available at  
https://learn.microsoft.com/enus/typography/legacy/legacy_arabic_fonts , which 
I had already debunked. Now that you explained the reasoning involving cost to 
benefit ratio, you&#39;ve just expressed a legitimate argument, which is 
similar to what Asmus Freytag had said.    Peter  From:  [email protected] 
&lt;[email protected]&gt;   Sent:  Thursday, May 7, 2026 9:03 AM  To:  Peter 
Constable via Unicode &lt;[email protected]&gt;; Peter Constable 
&lt;[email protected]&gt;; Philippe Verdy &lt;[email protected]&gt;  Subject:  RE: 
Odp: Pd: Missing legacy Arabic encoding  This is not of any interest, because 
the current Microsoft company is not even close ideologically to the past 
version of Microsoft that originally made the Arabic terminals, in fact 
they&#39;re not even ideologically compatible with each other. There&#39;s no 
need to speculate on hypothetical voting scenarios involving giant companies. 
What matters are logical arguments involving encoding policy, and the 
prevailing reason is that there is no sufficient evidence of user community 
that would need to interchange text from those platforms into UCS-2 terminals. 
It is still necessary to debunk false arguments in order to prevent future 
decisions from being made incorrectly, because even if they wouldn&#39;t affect 
the outcome of this proposal, they could still improperly influence the 
evaluation of future proposals.  Dnia 07 maja 2026 17:36  Peter Constable via 
Unicode  &lt;   [email protected]  &gt; napisał(a): Just in case it 
might be of interest, if a motion on this proposal were raised in a UTC 
meeting, I suspect Microsoft would vote against encoding.     Peter   From:  
Unicode &lt;  [email protected] &gt;  On Behalf Of   
[email protected]  via Unicode  Sent:  May 6, 2026 11:09 AM  To:  Philippe 
Verdy via Unicode &lt;  [email protected] &gt;; Philippe Verdy &lt;  
[email protected] &gt;  Subject:  Re: Odp: Pd: Missing legacy Arabic encoding  
The ReadConsoleOutputW function, by definition, captures the tiles into the 
lpBuffer, which is a random access array of CHAR_INFO structure, whose 
horizontal and vertical size is specified in dwBufferSize. Since lpBuffer is 
random access, this implies one CHAR_INFO structure per character tile. The API 
therefore fundamentally imposes a strict memory layout that cannot be violated. 
The fact that some Unix-like terminals such as Windows Terminal may support 
features outside the scope of the CHAR_INFO structure for compatibility with 
ANSI escape codes or WSL programs does not invalidate the compatibility 
considerations for legacy DOS/Win16/Win32 programs that require all character 
tiles to fit in the CHAR_INFO structure for random access, because 4 byte 
CHAR_INFO structure of Win32 is intended to be a fully backwards compatible 
extension of the 2 byte VGA text mode tile structure of DOS/Win16.  Dnia 06 
maja 2026 19:58  Philippe Verdy via Unicode  &lt;   [email protected]  
&gt; napisał(a): Windows can use other ways to map 16-bit codes in its *legacy* 
Console buffer (using old CHAR_INFO structure), it can perfectly internally use 
compatibility characters, or PUAs of the BMP, and still present an API that 
exposes connforming sequences. You&#39;re talking about an old implementation 
that was built even  long before the Arabic script was extended (and newer 
scripts using contextual joining behaviors, that have never been part of the 
BMP, shcih as Adlam, and other scripts like Mongolian that also may need such 
sequences with ZWJ/ZWNJ controls, or with other formatting characters like 
those specific to Mongolian like FVS1...FVS4 and MVS, or those common to many 
Bhramic scripts, that the *legacy* Console did not support.  The *legacy* 
console was not built to support more than one plane (including many CJK 
cgaracters). The newer console can!  Le mer. 6 mai 2026 à 18:37,   
[email protected]  &lt;  [email protected] &gt; a écrit : Have you read 
the L2/26-077 proposal? Using ZWJ or ZWNJ would not work for the compatibility 
purposes at all as already explained in the proposal. This is because ZWJ or 
ZWNJ would take the space of one character tile in the CHAR_INFO structure. 
Suppose that you&#39;re trying to map 0xD0 from FP164 to a sequence of U+FE7C 
U+200D U+064B ( ﹼ‍ً ). The legacy application fills the 80×25 screen with all 
0xD0 tiles. You subsequently try to capture the tiles with a Win32 program by 
using ReadConsoleOutputA into an 80×25 buffer of 2000 tiles. This succeeds and 
captures 0xD0 into all the tiles. You then try to capture the tiles using 
ReadConsoleOutputW into an 80×25 buffer. Each sequence U+FE7C U+200D U+064B 
would take a sequence of three CHAR_INFO structures to store, meaning 6000 such 
structures for the whole screen. But the 80×25 buffer has only room for 2000 
instances of the structure (one per character tile). Since CHAR_INFO stores 
16-bit character code, by that same logic the compatibility characters would 
have to be in BMP for it to work. In Windows 95 Vietnamese and Windows 95 Thai, 
there are instances where one character tile takes multiple CHAR_INFO 
structures, causing visual width to be smaller than logical width, and when 
that happens, the remaining space at the end of the line is left blank, 
allowing for CP1258/CP874 combining characters in those systems to map 1:1 to 
their Unicode equivalents. Windows 3.1/95/98/ME Arabic don&#39;t work that way 
and don&#39;t use combining characters or ZWJ sequences, so visual width is 
always equivalent to logical width, each character tile maps 1:1 to a CHAR_INFO 
structure and all characters may fill the entire line, which would be 
impossible if some of those characters were mapped to composition sequences or 
non-BMP characters. Since there is currently no sufficient evidence of user 
community that would need to use those mappings, there are no plans for those 
characters to be added to Unicode, and therefore the only solution for the 
ReadConsoleOutputW to work properly in this case is to use agreed upon private 
use mappings for those compatibility characters.  Dnia 06 maja 2026 18:06  
Philippe Verdy via Unicode  &lt;   [email protected]  &gt; napisał(a): 
You actually don&#39;t need any new compatibility characters for Arabic 
contextual forms, or for other contextual forms in other joining scripts (like 
Adlam, or even Mongolian whichbis a LTR script).  You just have to prepend or 
append a ZWJ or ZWNJ formatting control to the unified letter if you want to 
override its default contextual presentation form.   Le mar. 5 mai 2026, 00:48, 
Asmus Freytag via Unicode &lt;  [email protected] &gt; a écrit : The 
issue at hand is the distinction between a theoretical gap and real-life 
problem.  You have demonstrated that there are specifications that, if chained 
in the right way, can lead to ambiguities or gaps in interchange.  What we 
don&#39;t have is an actual use case with real-life consequences for a set of 
existing users, not hypothetical ones.  When it comes to encoding decisions 
based on existing documents, there is a strong presumption that once 
sufficiently many documents exist that contain a character, that this character 
will be needed in digitizing these documents, whether immediately, or 
eventually (e.g. in the case of future scholarly studies). Also, the texts 
themselves exist, barring accidents, in permanence. Therefore, it is justified 
to consider irrevocably allocating a character that will map to this source in 
perpetuity, even though each encoded character carries a small cost for 
implementers.   However, when it comes to legacy characters, there&#39;s an 
additional cost that is imposed, and that is based on the fact that characters 
that are encoded solely for compatibility will usually violate one or more of 
the other encoding principles, something that incrementally complicates the 
standard. Even for people who never intend to use that character.   Therefore, 
the SEW is on solid ground when it demands not only a hypothetical scenario, 
but evidence of actual impact on actual users. Not only whether some 
application could invoke an API, but whether such applications exist and are 
used today to access documents encoded using the legacy characters in a way 
that is compromised irreparably by not having an encoding for them.  A./   On 
5/4/2026 10:24 AM,   [email protected]  via Unicode wrote: In UTC 187 
Minutes, &#34; Asmus Freytag noted that the fact that lists of things existed 
in the past does not make these things plain text. Ned Holbrook pointed out 
that the purported issue occurs in a closed system, not in public interchange. 
&#34;. However, the arguments in the proposal do not merely hinge on the 
encodings being lists of characters, but specifically points out methods to 
interchange text, including an example of copying terminal output and pasting 
to Notepad, where the copying invokes the mapping of the current terminal 
codepage to UCS-2 (as is CHAR_INFO compatible) and the pasting writes it into 
plain text. Win32 is also not a closed system, as Win32 can capture the tiles 
of the output of Windows 3.1 Arabic DOS/Win16 programs and Windows 95/98/ME 
Arabic DOS/Win16/Win32 programs, but Win32 can also interact with public text 
interchange systems by reading and writing to files and network. I&#39;m not 
saying that Unicode absolutely must include those characters, but those kinds 
of misleading claims are causing users to misunderstand what the proposal is 
about, and I don&#39;t want Unicode to be relying on uninformed decisions to 
evaluate proposals.  Dnia 18 kwietnia 2026 13:36  [email protected] via 
Unicode&lt; [email protected] &gt;  napisał(a): The SEW subsequently 
explained that the actual reason is due to insufficient evidence of user 
community that would need to use the resulting mapping. Despite Win32 being a 
highly popular platform with plenty of backwards compatibility and native UCS-2 
terminal support, the specific use cases of installing codepages into Windows 
NT and using terminal tiles from Windows 3.1/95/98/ME are not sufficiently 
documented, making it difficult for any user communities to form around it. So 
it seems like the idea of standardizing legacy Arabic terminal BMP mappings is 
a dead end for now.  Dnia 17 kwietnia 2026 22:59  [email protected] via 
Unicode&lt; [email protected] &gt;  napisał(a): The Recommendations in 
L2/26-100 claim that Microsoft&#39;s documentation of legacy Arabic encodings 
is available at  learn.microsoft.com 
https://learn.microsoft.com/en-us/typography/legacy/legacy_arabic_fonts . 
However, that article only demonstrates two encodings of TrueType fonts, which 
are used in Windows 3.1 but are completely different from the eight terminal 
encodings. Unlike the TrueType encodings which represent internal shaping 
mappings and are not used for text interchange, the terminal encodings have 
been demonstrated to be directly used in text interchange through int 10h and 
ReadConsoleOutputA/WriteConsoleOutputA as already demonstrated in L2/26-077. 
The Recommendations also claim that the proposal does not demonstrate any need 
for interchange or encoding, but the proposal actually demonstrated such a need 
due to the logical extension of the Win32 terminal API to the functions 
ReadConsoleOutputW/WriteConsoleOutputW, which are in Windows NT and may be used 
on the output of previously ran programs (including those that used the legacy 
Arabic terminal encodings), which given the CHAR_INFO structure, therefore 
implies a need for all the tiles to map to BMP for interchange. I&#39;m not 
objecting to the SEW&#39;s conclusion of &#34;Users are expected to use 
PUA.&#34;, which can indeed be used to provide a mapping even if not 
standardized, but the reasoning given was flawed.  Dnia 09 stycznia 2026 17:25  
[email protected]&lt; [email protected] &gt;  napisał(a): The following 
Win32 C code will output 256 characters in system console codepage into the 
character grid, capture those character tiles in UCS-2 if possible, and then 
output the current console codepage number.  #include &lt;windows.h&gt;  
#include &lt;stdio.h&gt;  int main(){  HANDLE 
hConsole=GetStdHandle(STD_OUTPUT_HANDLE);  CHAR_INFO screen[256];  COORD 
size={16,16,};  COORD pos={0,0,};  SMALL_RECT rect={0,0,15,15,};  for(int 
i=0;i&lt;256;i++){  screen[i].Attributes=0xF0;  screen[i].Char.AsciiChar=i;  }  
WriteConsoleOutputA(hConsole,screen,size,pos,&amp;rect);  CHAR_INFO 
screenu[256];  if(ReadConsoleOutputW(hConsole,screenu,size,pos,&amp;rect)){  
for(int i=0;i&lt;256;i++) printf(&#34;%04X &#34;,screenu[i].Char.UnicodeChar);  
}  else{  printf(&#34;error %08X\n&#34;,GetLastError());  }  
printf(&#34;codepage %u&#34;,GetConsoleOutputCP());  } In most cases, whenever 
a legacy Win32 codepage is used, the application can run on Windows NT to 
capture the UCS-2 mapping of those character cells to the BMP (although for CJK 
codepages a more complex setup would be necessary due to thousands of fullwidth 
characters with 2-byte sequences).  However, in Arabic versions of Windows 9x 
(95/98/ME) the resulting character set has many presentation forms that are not 
in Unicode. This is the result when running on Windows ME:  i.imgur.com 
https://i.imgur.com/QFm3SkI.png  in 10×20 font,  i.imgur.com 
https://i.imgur.com/KUbLQ0A.png  in 10×18 font (same result also appears in 
Windows 95/98). 5×12, 7×12, 8×12, 10×18, 10×20, and 12×16 bitmap fonts have 
been attested with that character set (VGAOEM.FON, 8514OEM.FON, DOSAPP.FON). 
The 10×20 font has slightly different mapping than the other sizes: 0x93 is ö 
instead of ô, and 0x97 is missing (causing the following characters on the same 
line to be drawn at the wrong position). It also claims to be using codepage 
720, but many characters differ from their CP720 mappings, including the 
bundled CP_720.NLS mappings (for example,  ـ (U+0640 ARABIC TATWEEL) is 0x95 in 
CP720, but in the console 0x95 is  ش instead, and the tatweel is at 0xFF). On 
Windows 9x, ReadConsoleOutputW is not supported so the UCS-2 mappings of the 
console character tiles cannot be captured (error 0x00000078 
ERROR_CALL_NOT_IMPLEMENTED).  When that program runs on Arabic versions of 
Windows NT, the visual output is of the CP437 character set if one of the 
bundled bitmap fonts is used ( i.imgur.com https://i.imgur.com/RxjtxMH.png ), 
or the CP720 set if Lucida Console is used, with the Arabic letters either 
having glitchy font substitution (NT 4.0, NT 5.0/2000) or the .notdef glyph (NT 
5.1/XP and up). In fact, it seems that the only Arabic bitmap fonts that occur 
in Windows NT are CP1256 fonts, which are not used in terminals. So this 
appears to be one of those permanent Windows compatibility regressions that 
occured when Windows 9x ended, where the terminals can no longer render legacy 
Arabic text. Even if the user managed to use registry hacks to set the font to 
Courier New or Simplified Arabic Fixed, it would still use the CP720 mapping 
which is not compatible with the Windows 9x set.  It appears that in the 
Windows 9x Arabic terminal character set, 244 characters (   
ﺀﺁﺂﺃﺄﺅﺇﺈﺊﺋﺍﺎﺏﺑﺓ►◄↕ﺕ¶§ﺗﺙ↑↓→←ﺛﹰ ▲▼  
!&#34;#$%&amp;&#39;()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
 ﺝﺟﺡ éâ ﺣ à ﺥ çêëèïî ﺧﺩﺫﺭﺯ ô ﺳ ûù 
ﺷﺻ£ﺿﻁﻅﻉﻊﻋﻌﻍﻎﻏﻐﻑﻓﻕﻗﻙﻛ«»ﹱ▒ﹲ│┤ﹴﹶﹷﹸ٠١٢٣ﹹﹺ┐└┴┬├─┼ﹻﹾ٤٥٦٧٨٩،ﹿﱞﱟﱠﳲﱡﳳﱢ┘┌؛؟� µ 
ﻩﻫﻬﻭﻯﻰﻱﻲﻳﳴﹼﹽﺱﺵﺹﺽﹳ°·■ـ ) are already in Unicode, but 12 characters are not in 
Unicode: • 6 of them are pieces of lam-alef ligatures (0xDD, 0xDE, 0xF9, 0xFB, 
0xFC, 0xFD) • 2 of them are shadda with fathatan ligatures without or with 
tatweel (0xD0, 0xD1) — in some legacy Microsoft fonts, shadda with fathatan is 
mapped to private use U+E818 • 4 of them are disunifications of 
seen/sheen/sad/dad occuring either with or without tail —  ﹳ (U+FE73 ARABIC 
TAIL FRAGMENT) was originally encoded in Unicode 3.2 for CP864 compatibility; 
in that codepage, the forms of seen/sheen/sad/dad attach to the tail fragment — 
forms with included tail: 0x92, 0x95, 0x98, 0x8A — forms without tail 
(attaching to tail fragment like in CP864): 0xF3, 0xF4, 0xF5, 0xF6  If someone 
tried to make a Win32 console implementation and tried to implement both 
Windows 9x Arabic terminal character set compatibility and wide string API 
(ReadConsoleOutputW) compatibility simultaneously, then they would run into the 
issue that there is currently no standardized mapping to handle that scenario. 
What should Windows 9x Arabic console compatible implementations do in that 
case?
RE: Odp: Pd: Missing legacy Arabic encoding

Reply via email to