Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
tags 826256 = upstream forwarded 826256 https://sourceware.org/bugzilla/show_bug.cgi?id=21750 thanks OK, I’ve forwarded this upstream and refreshed the researched data: https://sourceware.org/bugzilla/show_bug.cgi?id=21750 Thanks, //mirabilos -- tarent solutions GmbH Rochusstraße 2-4, D-53123 Bonn • http://www.tarent.de/ Tel: +49 228 54881-393 • Fax: +49 228 54881-235 HRB 5168 (AG Bonn) • USt-ID (VAT): DE122264941 Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg
Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
Aurelien Jarno dixit: >EastAsian.txt explicitly lists the hexagrams as neutral width, so I don't Yes it does, but neutral does NOT always mean 1. I even looked it up today, as I was not familiar enough with neutral yet. >Looking at the behaviour from other systems, freebsd and netbsd both >return -1 here, while openbsd returns 1. None of them returns 2. -1 is utterly wrong, it’s returned for a control character… likely missing locale support. The last time I looked at OpenBSD, they did not have any support for anything resembling UTF-8 either, but from what I’ve heard, they’re working on changing it. >Therefore, can you please give a pointer explaining while the width >should be 2 instead of 1? I can give two pointers. One being the presence of 4DC0 (et al.) in src:unifont (= 1:8.0.01-1) in font/plane00/unifont-base.hex as a fullwidth (“wide”, in Unicode speak; as I learnt today, UAX #11 “fullwidth” is a subset of “wide” that only applies when it has a “ decomposition”) character, i.e. one with 64 nybbles, like 3000, and unlike 0041. Two being src:xterm (= 324-2) wcwidth.c, the function mk_wcwidth has codepoints 2E80‥0xA4CF, excluding 303F, as wide. The xterm argument is actually extremely strong – it’s t̲h̲e̲ single most widespeadly used wcwidth() implementation, copied into lots of code that doesn’t (or can’t) rely on the system’s implementation. Similarily, GNU libutf8 uses (relevant part of the if construct): || (c >= 0x2e80 && c < 0xa4d0 /* CJK ... Yi */ && !((c & ~0x0011) == 0x300a || c == 0x303f)) This is almost the same as xterm, with the additional exception of 300A, 300B, 301A, 301B added (which I contest, they are W(ide) in a random EastAsianWidth file I’ve got lying around, but that’s a different topic and correct in glibc already anyway). AFAICT glibc currently has Unicode 7.0.0 data in use. When I run my script on UCD+EAW 7.0.0, I get the following output. The format is: bsearch form, i.e. a list of (low, high) tuples; the code first checks for NUL, DEL, C0 and C1 control characters, then bsearches mb_ucs_combining, then mb_ucs_fullwidth, and if it’s still not found, the width is 1 (UAX #11 “ambiguous” is assumed narrow): static const struct mb_ucsrange mb_ucs_combining[] = { { 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 }, { 0x05C7, 0x05C7 }, { 0x0600, 0x0605 }, { 0x0610, 0x061A }, { 0x061C, 0x061C }, { 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06DD }, { 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0816, 0x0819 }, { 0x081B, 0x0823 }, { 0x0825, 0x0827 }, { 0x0829, 0x082D }, { 0x0859, 0x085B }, { 0x08E4, 0x0902 }, { 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 }, { 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 }, { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 }, { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 }, { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 }, { 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 }, { 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC }, { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD }, { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 }, { 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 }, { 0x0B82, 0x0B82 }, { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C00, 0x0C00 }, { 0x0C3E, 0x0C40 }, { 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 }, { 0x0C62, 0x0C63 }, { 0x0C81, 0x0C81 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 }, { 0x0D01, 0x0D01 }, { 0x0D41, 0x0D44 }, { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 }, { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 }, { 0x0F99, 0x0FBC }, {
Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
Processing control commands: > tag -1 + moreinfo Bug #826256 [locales] locales: wrong width for hexagrams (and possibly others) in 2.22 Ignoring request to alter tags of bug #826256 to the same tags previously set -- 826256: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
Processing control commands: > tag -1 + moreinfo Bug #826256 [locales] locales: wrong width for hexagrams (and possibly others) in 2.22 Added tag(s) moreinfo. -- 826256: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
control: tag -1 + moreinfo On 2016-06-03 19:29, Thorsten Glaser wrote: > Package: locales > Version: 2.22-0experimental0 > Severity: normal > Tags: upstream > > Starting with locales 2.22-0experimental0, some chars have the wrong > width; downgrading locales to 2.21-9 fixes the bugs. > > Test program: > > tglase@tglase:~ $ cat x.c > #define _XOPEN_SOURCE > #include > #include > #include > > #define D(x) printf("%04X %d\n",(x),wcwidth(x)) > > int > main(void) > { > setlocale(LC_ALL, ""); > > D(0x41); > D(0x0300); > D(0x3000); > D(0x4DC0); > D(0xFFFD); > return (0); > } > tglase@tglase:~ $ gcc x.c > tglase@tglase:~ $ rm -rf tloc; mkdir tloc > > tglase@tglase:~ $ localedef -i en_US -c -f UTF-8 tloc/en_US.UTF-8 > > tglase@tglase:~ $ LOCPATH=$PWD/tloc LC_ALL=en_US.UTF-8 ./a.out > > 0041 1 > 0300 0 > 3000 2 > 4DC0 1 > FFFD 1 > > Output while locales_2.21-9_all.deb was installed during localedef: > > tglase@tglase:~ $ LOCPATH=$PWD/tlocx LC_ALL=en_US.UTF-8 ./a.out > > 0041 1 > 0300 0 > 3000 2 > 4DC0 2 > FFFD 1 > > This is because /usr/share/i18n/charmaps/UTF-8.gz now lacks > entries for 4DC0‥4FFF. > > According to my own code implementing Unicode in another operating > system, with focus on wcwidth(3), after parsing EastAsianWidth.txt > special handling is needed to set widths of 0xFF00, 0x3248‥0x324F, > and 0x4DC0‥0x4DFF to “wide”, as they’re “neutral” normally – which EastAsian.txt explicitly lists the hexagrams as neutral width, so I don't think there is a bug there. Version from unicode 3.0 and earlier didn't specify those characters, and the behaviour from glibc 2.21 is probably coming from there and is probably wrong. Looking at the behaviour from other systems, freebsd and netbsd both return -1 here, while openbsd returns 1. None of them returns 2. Therefore, can you please give a pointer explaining while the width should be 2 instead of 1? Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
Package: locales Version: 2.22-0experimental0 Severity: normal Tags: upstream Starting with locales 2.22-0experimental0, some chars have the wrong width; downgrading locales to 2.21-9 fixes the bugs. Test program: tglase@tglase:~ $ cat x.c #define _XOPEN_SOURCE #include #include #include #define D(x) printf("%04X %d\n",(x),wcwidth(x)) int main(void) { setlocale(LC_ALL, ""); D(0x41); D(0x0300); D(0x3000); D(0x4DC0); D(0xFFFD); return (0); } tglase@tglase:~ $ gcc x.c tglase@tglase:~ $ rm -rf tloc; mkdir tloc tglase@tglase:~ $ localedef -i en_US -c -f UTF-8 tloc/en_US.UTF-8 tglase@tglase:~ $ LOCPATH=$PWD/tloc LC_ALL=en_US.UTF-8 ./a.out 0041 1 0300 0 3000 2 4DC0 1 FFFD 1 Output while locales_2.21-9_all.deb was installed during localedef: tglase@tglase:~ $ LOCPATH=$PWD/tlocx LC_ALL=en_US.UTF-8 ./a.out 0041 1 0300 0 3000 2 4DC0 2 FFFD 1 This is because /usr/share/i18n/charmaps/UTF-8.gz now lacks entries for 4DC0‥4FFF. According to my own code implementing Unicode in another operating system, with focus on wcwidth(3), after parsing EastAsianWidth.txt special handling is needed to set widths of 0xFF00, 0x3248‥0x324F, and 0x4DC0‥0x4DFF to “wide”, as they’re “neutral” normally – which can be either – but display on a fixed-width terminal is otherwise impossible. (Chars outside the BMP were not considered – there may be others needing such handling… personally, I’d consider at least all emouji need to be fullwidth but there’s no standard backing it yet.) Rationale here: compatibility with wcwidth(3) implementations such as the one in xterm. (I’ve done the code in MirBSD to generate the data for my new wcwidth(3) implementation carefully so that – when using the same Unicode version as Markus Kuhn did – both implemen‐ tations return the same width for all characters.) This is especially important as I happen to use ䷐ (U+4DD0) for UI elements, and now all I get is a half-width replacement character, due to X11 font selection choosing the half-width font part, for a full-width character cell with an empty right half. -- System Information: Debian Release: stretch/sid APT prefers unreleased APT policy: (500, 'unreleased'), (500, 'buildd-unstable'), (500, 'unstable') Architecture: x32 (x86_64) Foreign Architectures: i386, amd64 Kernel: Linux 4.5.0-2-amd64 (SMP w/4 CPU cores) Locale: LANG=C, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/lksh Init: sysvinit (via /sbin/init) Versions of packages locales depends on: ii debconf [debconf-2.0] 1.5.59 ii libc-bin 2.22-10 ii libc-l10n 2.22-10 locales recommends no packages. locales suggests no packages. -- debconf information: locales/locales_to_be_generated: locales/default_environment_locale: None