Dnia 28 kwietnia 2025 12:00 Giacomo Catenazzi via Unicode 
<unicode@corp.unicode.org> napisał(a):  On 2025-04-26 12:51, Eli 
Zaretskii via Unicode wrote:  From: Dilyan Palauzov <b...@bapha.be>  Cc: 
unicode@corp.unicode.org  Date: Sat, 26 Apr 2025 12:00:50 +0300   From: Eli 
Zaretskii <e...@gnu.org>  Subject: Re: Is emoji +VC15, +VC16, without VC 
one or two columns with monospace font?🏝️  Date: 26/04/25 10:02:48   I think 
you have very outdated mental model of how the Windows console works and how it 
represents and encodes characters.  In particular, the width of a character is 
NOT determined by the length of its byte sequence, but by the font glyphs used 
to display those characters.  I am confused.  Does the width of an emoji/a 
character depend on the font (thus font designers decide this), or does it 
depend on EastAsianWidth.txt ?  It depends on the font, but the font is 
supposed to go by what  EastAsianWidth.txt says.   It is worse. We are 
discussing monospaced fonts, and so terminals may  select where to display each 
characters (skipping all font hints) and  overwriting part of character.   The 
concept of a character grid, as originally implemented in legacy systems, 
fundamentally implies non-overlapping columns/rows of equal width/height, with 
each character cell having its character code, as well as background and 
foreground colors if supported. Therefore, some terminals will only allow 
monospaced fonts to be used (otherwise the character cell width is undefined 
and therefore the character grid cannot be rendered), and as a special case 
they may allow duospaced fonts to be used for CJK codepage compatibility. Some 
terminals may allow proportional fonts, but they will distort the font to fit 
the character grid, not the other way around.   Also note: I do not like the 
division Unix/non-Unix:. "Unix" terminal  had different 
interpretations. E.g. if we look the initial Unicode  support of xterm (so the 
mother of many "unix pseudoterminals), we learn  that it supported only 
"Unicode level 1" (and obsolete terminology in  old Unicode standards, 
or just in ISO). So it did interpret each  codepoint independently (so no 
combining codepoints).   Perhaps it might be better to use another term such as 
'non-random-access terminals' or 'variable-memory cell 
terminals' to refer to the terminals detached from the original concept 
that each character cell has a constant amount of memory associated with it.   
Also a good documentation on width of characters in terminals: problems,  
solutions, and interpretation of width in many implementations, from  gosthy 
(the new kid in the block):  mitchellh.com 
https://mitchellh.com/writing/grapheme-clusters-in-terminals.   giacomo   The 
metric compatibility of terminals is generally a matter of backwards 
compatibility, so the behavior of legacy platforms is relevant. The use of 
wcwidth is specific to Unix-like environments. The actual origin of the 
fullwidth characters is in legacy CJK encodings, where two consecutive bytes 
are placed in two consecutive character cells. On legacy non-Unix platforms the 
width precedent is set by legacy codepages, not by the wcwidth function. For 
example, ¨ (U+00A8) is 0xF9 in CP850 (single byte, which is halfwidth), but ¨ 
(U+00A8) is 0x81 0x4E in Shift JIS/CP932 (double byte, which is fullwidth). 
Therefore a backwards compatible Unicode extrapolation of a legacy terminal 
would still have to vary its behavior depending on the system locale/codepage 
to remain compatible. Win32 console seems to apply codepage-specific 
compatibility: in non-CJK codepages it simply maps each non-control character 
(or UCS-2 code unit when using Unicode text) directly to its corresponding 
character cell, but in CJK codepages it uses the appropriate CJK fonts and maps 
the codepage's fullwidth characters to two consecutive character cells to 
maintain compatibility (and for bidirectional codepages it might be doing 
something else entirely). However, the string "🧑‍🌾" maps to the 
corresponding UCS-2 code points 0xD83E 0xDDD1 0x200D 0xD83C 0xDF3E and since 
none of those codepoints have any CJK codepage compatibility precedent, they 
are written directly into five character cells regardless of the system 
codepage, the content of those cells can be retrieved with the 
ReadConsoleOutput  function (resulting in a random access array of CHAR_INFO 
structures) , and this itself sets a precedent for Win32 console compatibility. 
This is of course very different from wcwidth compatibility or mode 2027 
compatibility, and yet is not mentioned in that article. That article therefore 
describes the widths only in Unix-like contexts.

Reply via email to