Re: Odp: Pd: Missing legacy Arabic encoding

Asmus Freytag via Unicode Mon, 04 May 2026 15:49:54 -0700

The issue at hand is the distinction between a theoretical gap andreal-life problem.

You have demonstrated that there are specifications that, if chained inthe right way, can lead to ambiguities or gaps in interchange.

What we don't have is an actual use case with real-life consequences fora set of existing users, not hypothetical ones.

When it comes to encoding decisions based on existing documents, thereis a strong presumption that once sufficiently many documents exist thatcontain a character, that this character will be needed in digitizingthese documents, whether immediately, or eventually (e.g. in the case offuture scholarly studies). Also, the texts themselves exist, barringaccidents, in permanence. Therefore, it is justified to considerirrevocably allocating a character that will map to this source inperpetuity, even though each encoded character carries a small cost forimplementers.

However, when it comes to legacy characters, there's an additional costthat is imposed, and that is based on the fact that characters that areencoded solely for compatibility will usually violate one or more of theother encoding principles, something that incrementally complicates thestandard. Even for people who never intend to use that character.

Therefore, the SEW is on solid ground when it demands not only ahypothetical scenario, but evidence of actual impact on actual users.Not only whether some application could invoke an API, but whether suchapplications exist and are used today to access documents encoded usingthe legacy characters in a way that is compromised irreparably by nothaving an encoding for them.


A./


On 5/4/2026 10:24 AM, [email protected] via Unicode wrote:

In UTC 187 Minutes, "Asmus Freytag noted that the fact that lists ofthings existed in the past does not make these things plain text. NedHolbrook pointed out that the purported issue occurs in a closedsystem, not in public interchange.". However, the arguments in theproposal do not merely hinge on the encodings being lists ofcharacters, but specifically points out methods to interchange text,including an example of copying terminal output and pasting toNotepad, where the copying invokes the mapping of the current terminalcodepage to UCS-2 (as is CHAR_INFO compatible) and the pasting writesit into plain text. Win32 is also not a closed system, as Win32 cancapture the tiles of the output of Windows 3.1 Arabic DOS/Win16programs and Windows 95/98/ME Arabic DOS/Win16/Win32 programs, butWin32 can also interact with public text interchange systems byreading and writing to files and network. I'm not saying that Unicodeabsolutely must include those characters, but those kinds ofmisleading claims are causing users to misunderstand what the proposalis about, and I don't want Unicode to be relying on uninformeddecisions to evaluate proposals.



    *Dnia 18 kwietnia 2026 13:36* [email protected] via Unicode
    <mailto:[email protected]>< [email protected] >
    napisał(a):

    The SEW subsequently explained that the actual reason is due to
    insufficient evidence of user community that would need to use the
    resulting mapping. Despite Win32 being a highly popular platform
    with plenty of backwards compatibility and native UCS-2 terminal
    support, the specific use cases of installing codepages into
    Windows NT and using terminal tiles from Windows 3.1/95/98/ME are
    not sufficiently documented, making it difficult for any user
    communities to form around it. So it seems like the idea of
    standardizing legacy Arabic terminal BMP mappings is a dead end
    for now.


        *Dnia 17 kwietnia 2026 22:59* [email protected] via Unicode
        <mailto:[email protected]>< [email protected] >
        napisał(a):

        The Recommendations in L2/26-100 claim that Microsoft's
        documentation of legacy Arabic encodings is available at
        https://learn.microsoft.com/en-us/typography/legacy/legacy_arabic_fonts.
        However, that article only demonstrates two encodings of
        TrueType fonts, which are used in Windows 3.1 but are
        completely different from the eight terminal encodings. Unlike
        the TrueType encodings which represent internal shaping
        mappings and are not used for text interchange, the terminal
        encodings have been demonstrated to be directly used in text
        interchange through int 10h and
        ReadConsoleOutputA/WriteConsoleOutputA as already demonstrated
        in L2/26-077. The Recommendations also claim that the proposal
        does not demonstrate any need for interchange or encoding, but
        the proposal actually demonstrated such a need due to the
        logical extension of the Win32 terminal API to the functions
        ReadConsoleOutputW/WriteConsoleOutputW, which are in Windows
        NT and may be used on the output of previously ran programs
        (including those that used the legacy Arabic terminal
        encodings), which given the CHAR_INFO structure, therefore
        implies a need for all the tiles to map to BMP for
        interchange. I'm not objecting to the SEW's conclusion of
        "Users are expected to use PUA.", which can indeed be used to
        provide a mapping even if not standardized, but the reasoning
        given was flawed.


            *Dnia 09 stycznia 2026 17:25* [email protected] <
            [email protected] > napisał(a):

            The following Win32 C code will output 256 characters in
            system console codepage into the character grid, capture
            those character tiles in UCS-2 if possible, and then
            output the current console codepage number.


            #include <windows.h>
            #include <stdio.h>
            int main(){
            HANDLE hConsole=GetStdHandle(STD_OUTPUT_HANDLE);
            CHAR_INFO screen[256];
            COORD size={16,16,};
            COORD pos={0,0,};
            SMALL_RECT rect={0,0,15,15,};
            for(int i=0;i<256;i++){
            screen[i].Attributes=0xF0;
            screen[i].Char.AsciiChar=i;
            }
            WriteConsoleOutputA(hConsole,screen,size,pos,&rect);
            CHAR_INFO screenu[256];
            if(ReadConsoleOutputW(hConsole,screenu,size,pos,&rect)){
            for(int i=0;i<256;i++) printf("%04X
            ",screenu[i].Char.UnicodeChar);
            }
            else{
            printf("error %08X\n",GetLastError());
            }
            printf("codepage %u",GetConsoleOutputCP());
            }

            In most cases, whenever a legacy Win32 codepage is used,
            the application can run on Windows NT to capture the UCS-2
            mapping of those character cells to the BMP (although for
            CJK codepages a more complex setup would be necessary due
            to thousands of fullwidth characters with 2-byte sequences).


            However, in Arabic versions of Windows 9x (95/98/ME) the
            resulting character set has many presentation forms that
            are not in Unicode. This is the result when running on
            Windows ME: https://i.imgur.com/QFm3SkI.png in 10×20 font,
            https://i.imgur.com/KUbLQ0A.png in 10×18 font (same result
            also appears in Windows 95/98). 5×12, 7×12, 8×12, 10×18,
            10×20, and 12×16 bitmap fonts have been attested with that
            character set (VGAOEM.FON, 8514OEM.FON, DOSAPP.FON). The
            10×20 font has slightly different mapping than the other
            sizes: 0x93 is ö instead of ô, and 0x97 is missing
            (causing the following characters on the same line to be
            drawn at the wrong position). It also claims to be using
            codepage 720, but many characters differ from their CP720
            mappings, including the bundled CP_720.NLS mappings (for
            example, ـ (U+0640 ARABIC TATWEEL) is 0x95 in CP720, but
            in the console 0x95 is ش instead, and the tatweel is at
            0xFF). On Windows 9x, ReadConsoleOutputW is not supported
            so the UCS-2 mappings of the console character tiles
            cannot be captured (error 0x00000078
            ERROR_CALL_NOT_IMPLEMENTED).


            When that program runs on Arabic versions of Windows NT,
            the visual output is of the CP437 character set if one of
            the bundled bitmap fonts is used
            (https://i.imgur.com/RxjtxMH.png), or the CP720 set if
            Lucida Console is used, with the Arabic letters either
            having glitchy font substitution (NT 4.0, NT 5.0/2000) or
            the .notdef glyph (NT 5.1/XP and up). In fact, it seems
            that the only Arabic bitmap fonts that occur in Windows NT
            are CP1256 fonts, which are not used in terminals. So this
            appears to be one of those permanent Windows compatibility
            regressions that occured when Windows 9x ended, where the
            terminals can no longer render legacy Arabic text. Even if
            the user managed to use registry hacks to set the font to
            Courier New or Simplified Arabic Fixed, it would still use
            the CP720 mapping which is not compatible with the Windows
            9x set.


            It appears that in the Windows 9x Arabic terminal
            character set, 244 characters
            ( ﺀﺁﺂﺃﺄﺅﺇﺈﺊﺋﺍﺎﺏﺑﺓ►◄↕ﺕ¶§ﺗﺙ↑↓→←ﺛﹰ▲▼
            
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ﺝﺟﺡéâﺣàﺥçêëèïîﺧﺩﺫﺭﺯôﺳûùﺷﺻ£ﺿﻁﻅﻉﻊﻋﻌﻍﻎﻏﻐﻑﻓﻕﻗﻙﻛ«»ﹱ▒ﹲ│┤ﹴﹶﹷﹸ٠١٢٣ﹹﹺ┐└┴┬├─┼ﹻﹾ٤٥٦٧٨٩،ﹿﱞﱟﱠﳲﱡﳳﱢ┘┌؛؟¤ﻝﻟﻡﻣﻥﻧµﻩﻫﻬﻭﻯﻰﻱﻲﻳﳴﹼﹽﺱﺵﺹﺽﹳ°·■ـ)
            are already in Unicode, but 12 characters are not in Unicode:

            • 6 of them are pieces of lam-alef ligatures (0xDD, 0xDE,
            0xF9, 0xFB, 0xFC, 0xFD)

            • 2 of them are shadda with fathatan ligatures without or
            with tatweel (0xD0, 0xD1)

            — in some legacy Microsoft fonts, shadda with fathatan is
            mapped to private use U+E818

            • 4 of them are disunifications of seen/sheen/sad/dad
            occuring either with or without tail

            — ﹳ (U+FE73 ARABIC TAIL FRAGMENT) was originally encoded
            in Unicode 3.2 for CP864 compatibility; in that codepage,
            the forms of seen/sheen/sad/dad attach to the tail fragment

            — forms with included tail: 0x92, 0x95, 0x98, 0x8A

            — forms without tail (attaching to tail fragment like in
            CP864): 0xF3, 0xF4, 0xF5, 0xF6


            If someone tried to make a Win32 console implementation
            and tried to implement both Windows 9x Arabic terminal
            character set compatibility and wide string API
            (ReadConsoleOutputW) compatibility simultaneously, then
            they would run into the issue that there is currently no
            standardized mapping to handle that scenario. What should
            Windows 9x Arabic console compatible implementations do in
            that case?

Re: Odp: Pd: Missing legacy Arabic encoding

Reply via email to