Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2017-07-11 Thread Thorsten Glaser
tags 826256 = upstream
forwarded 826256 https://sourceware.org/bugzilla/show_bug.cgi?id=21750
thanks

OK, I’ve forwarded this upstream and refreshed the researched data:
https://sourceware.org/bugzilla/show_bug.cgi?id=21750

Thanks,
//mirabilos
-- 
tarent solutions GmbH
Rochusstraße 2-4, D-53123 Bonn • http://www.tarent.de/
Tel: +49 228 54881-393 • Fax: +49 228 54881-235
HRB 5168 (AG Bonn) • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg



Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2016-06-03 Thread Thorsten Glaser
Aurelien Jarno dixit:

>EastAsian.txt explicitly lists the hexagrams as neutral width, so I don't

Yes it does, but neutral does NOT always mean 1. I even looked
it up today, as I was not familiar enough with neutral yet.

>Looking at the behaviour from other systems, freebsd and netbsd both
>return -1 here, while openbsd returns 1. None of them returns 2.

-1 is utterly wrong, it’s returned for a control character… likely
missing locale support. The last time I looked at OpenBSD, they did
not have any support for anything resembling UTF-8 either, but from
what I’ve heard, they’re working on changing it.

>Therefore, can you please give a pointer explaining while the width
>should be 2 instead of 1?

I can give two pointers.

One being the presence of 4DC0 (et al.) in src:unifont (= 1:8.0.01-1)
in font/plane00/unifont-base.hex as a fullwidth (“wide”, in Unicode
speak; as I learnt today, UAX #11 “fullwidth” is a subset of “wide”
that only applies when it has a “ decomposition”) character,
i.e. one with 64 nybbles, like 3000, and unlike 0041.

Two being src:xterm (= 324-2) wcwidth.c, the function mk_wcwidth has
codepoints 2E80‥0xA4CF, excluding 303F, as wide.

The xterm argument is actually extremely strong – it’s t̲h̲e̲ single
most widespeadly used wcwidth() implementation, copied into lots
of code that doesn’t (or can’t) rely on the system’s implementation.

Similarily, GNU libutf8 uses (relevant part of the if construct):
|| (c >= 0x2e80 && c < 0xa4d0  /* CJK ... Yi */
&& !((c & ~0x0011) == 0x300a || c == 0x303f))

This is almost the same as xterm, with the additional exception of
300A, 300B, 301A, 301B added (which I contest, they are W(ide) in
a random EastAsianWidth file I’ve got lying around, but that’s a
different topic and correct in glibc already anyway).

AFAICT glibc currently has Unicode 7.0.0 data in use. When I run
my script on UCD+EAW 7.0.0, I get the following output. The format
is: bsearch form, i.e. a list of (low, high) tuples; the code first
checks for NUL, DEL, C0 and C1 control characters, then bsearches
mb_ucs_combining, then mb_ucs_fullwidth, and if it’s still not found,
the width is 1 (UAX #11 “ambiguous” is assumed narrow):

static const struct mb_ucsrange mb_ucs_combining[] = {
{ 0x0300, 0x036F },
{ 0x0483, 0x0489 },
{ 0x0591, 0x05BD },
{ 0x05BF, 0x05BF },
{ 0x05C1, 0x05C2 },
{ 0x05C4, 0x05C5 },
{ 0x05C7, 0x05C7 },
{ 0x0600, 0x0605 },
{ 0x0610, 0x061A },
{ 0x061C, 0x061C },
{ 0x064B, 0x065F },
{ 0x0670, 0x0670 },
{ 0x06D6, 0x06DD },
{ 0x06DF, 0x06E4 },
{ 0x06E7, 0x06E8 },
{ 0x06EA, 0x06ED },
{ 0x070F, 0x070F },
{ 0x0711, 0x0711 },
{ 0x0730, 0x074A },
{ 0x07A6, 0x07B0 },
{ 0x07EB, 0x07F3 },
{ 0x0816, 0x0819 },
{ 0x081B, 0x0823 },
{ 0x0825, 0x0827 },
{ 0x0829, 0x082D },
{ 0x0859, 0x085B },
{ 0x08E4, 0x0902 },
{ 0x093A, 0x093A },
{ 0x093C, 0x093C },
{ 0x0941, 0x0948 },
{ 0x094D, 0x094D },
{ 0x0951, 0x0957 },
{ 0x0962, 0x0963 },
{ 0x0981, 0x0981 },
{ 0x09BC, 0x09BC },
{ 0x09C1, 0x09C4 },
{ 0x09CD, 0x09CD },
{ 0x09E2, 0x09E3 },
{ 0x0A01, 0x0A02 },
{ 0x0A3C, 0x0A3C },
{ 0x0A41, 0x0A42 },
{ 0x0A47, 0x0A48 },
{ 0x0A4B, 0x0A4D },
{ 0x0A51, 0x0A51 },
{ 0x0A70, 0x0A71 },
{ 0x0A75, 0x0A75 },
{ 0x0A81, 0x0A82 },
{ 0x0ABC, 0x0ABC },
{ 0x0AC1, 0x0AC5 },
{ 0x0AC7, 0x0AC8 },
{ 0x0ACD, 0x0ACD },
{ 0x0AE2, 0x0AE3 },
{ 0x0B01, 0x0B01 },
{ 0x0B3C, 0x0B3C },
{ 0x0B3F, 0x0B3F },
{ 0x0B41, 0x0B44 },
{ 0x0B4D, 0x0B4D },
{ 0x0B56, 0x0B56 },
{ 0x0B62, 0x0B63 },
{ 0x0B82, 0x0B82 },
{ 0x0BC0, 0x0BC0 },
{ 0x0BCD, 0x0BCD },
{ 0x0C00, 0x0C00 },
{ 0x0C3E, 0x0C40 },
{ 0x0C46, 0x0C48 },
{ 0x0C4A, 0x0C4D },
{ 0x0C55, 0x0C56 },
{ 0x0C62, 0x0C63 },
{ 0x0C81, 0x0C81 },
{ 0x0CBC, 0x0CBC },
{ 0x0CBF, 0x0CBF },
{ 0x0CC6, 0x0CC6 },
{ 0x0CCC, 0x0CCD },
{ 0x0CE2, 0x0CE3 },
{ 0x0D01, 0x0D01 },
{ 0x0D41, 0x0D44 },
{ 0x0D4D, 0x0D4D },
{ 0x0D62, 0x0D63 },
{ 0x0DCA, 0x0DCA },
{ 0x0DD2, 0x0DD4 },
{ 0x0DD6, 0x0DD6 },
{ 0x0E31, 0x0E31 },
{ 0x0E34, 0x0E3A },
{ 0x0E47, 0x0E4E },
{ 0x0EB1, 0x0EB1 },
{ 0x0EB4, 0x0EB9 },
{ 0x0EBB, 0x0EBC },
{ 0x0EC8, 0x0ECD },
{ 0x0F18, 0x0F19 },
{ 0x0F35, 0x0F35 },
{ 0x0F37, 0x0F37 },
{ 0x0F39, 0x0F39 },
{ 0x0F71, 0x0F7E },
{ 0x0F80, 0x0F84 },
{ 0x0F86, 0x0F87 },
{ 0x0F8D, 0x0F97 },
{ 0x0F99, 0x0FBC },
{ 

Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2016-06-03 Thread Debian Bug Tracking System
Processing control commands:

> tag -1 + moreinfo
Bug #826256 [locales] locales: wrong width for hexagrams (and possibly others) 
in 2.22
Ignoring request to alter tags of bug #826256 to the same tags previously set

-- 
826256: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2016-06-03 Thread Debian Bug Tracking System
Processing control commands:

> tag -1 + moreinfo
Bug #826256 [locales] locales: wrong width for hexagrams (and possibly others) 
in 2.22
Added tag(s) moreinfo.

-- 
826256: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2016-06-03 Thread Aurelien Jarno
control: tag -1 + moreinfo

On 2016-06-03 19:29, Thorsten Glaser wrote:
> Package: locales
> Version: 2.22-0experimental0
> Severity: normal
> Tags: upstream
> 
> Starting with locales 2.22-0experimental0, some chars have the wrong
> width; downgrading locales to 2.21-9 fixes the bugs.
> 
> Test program:
> 
> tglase@tglase:~ $ cat x.c
> #define _XOPEN_SOURCE
> #include 
> #include 
> #include 
> 
> #define D(x) printf("%04X %d\n",(x),wcwidth(x))
> 
> int
> main(void)
> {
>   setlocale(LC_ALL, "");
> 
>   D(0x41);
>   D(0x0300);
>   D(0x3000);
>   D(0x4DC0);
>   D(0xFFFD);
>   return (0);
> }
> tglase@tglase:~ $ gcc x.c
> tglase@tglase:~ $ rm -rf tloc; mkdir tloc 
>  
> tglase@tglase:~ $ localedef -i en_US -c -f UTF-8 tloc/en_US.UTF-8 
>  
> tglase@tglase:~ $ LOCPATH=$PWD/tloc LC_ALL=en_US.UTF-8 ./a.out
>  
> 0041 1
> 0300 0
> 3000 2
> 4DC0 1
> FFFD 1
> 
> Output while locales_2.21-9_all.deb was installed during localedef:
> 
> tglase@tglase:~ $ LOCPATH=$PWD/tlocx LC_ALL=en_US.UTF-8 ./a.out   
>  
> 0041 1
> 0300 0
> 3000 2
> 4DC0 2
> FFFD 1
> 
> This is because /usr/share/i18n/charmaps/UTF-8.gz now lacks
> entries for 4DC0‥4FFF.
> 
> According to my own code implementing Unicode in another operating
> system, with focus on wcwidth(3), after parsing EastAsianWidth.txt
> special handling is needed to set widths of 0xFF00, 0x3248‥0x324F,
> and 0x4DC0‥0x4DFF to “wide”, as they’re “neutral” normally – which

EastAsian.txt explicitly lists the hexagrams as neutral width, so I don't
think there is a bug there. Version from unicode 3.0 and earlier didn't
specify those characters, and the behaviour from glibc 2.21 is probably
coming from there and is probably wrong.

Looking at the behaviour from other systems, freebsd and netbsd both
return -1 here, while openbsd returns 1. None of them returns 2.

Therefore, can you please give a pointer explaining while the width
should be 2 instead of 1?

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

2016-06-03 Thread Thorsten Glaser
Package: locales
Version: 2.22-0experimental0
Severity: normal
Tags: upstream

Starting with locales 2.22-0experimental0, some chars have the wrong
width; downgrading locales to 2.21-9 fixes the bugs.

Test program:

tglase@tglase:~ $ cat x.c
#define _XOPEN_SOURCE
#include 
#include 
#include 

#define D(x) printf("%04X %d\n",(x),wcwidth(x))

int
main(void)
{
setlocale(LC_ALL, "");

D(0x41);
D(0x0300);
D(0x3000);
D(0x4DC0);
D(0xFFFD);
return (0);
}
tglase@tglase:~ $ gcc x.c
tglase@tglase:~ $ rm -rf tloc; mkdir tloc   
   
tglase@tglase:~ $ localedef -i en_US -c -f UTF-8 tloc/en_US.UTF-8   
   
tglase@tglase:~ $ LOCPATH=$PWD/tloc LC_ALL=en_US.UTF-8 ./a.out  
   
0041 1
0300 0
3000 2
4DC0 1
FFFD 1

Output while locales_2.21-9_all.deb was installed during localedef:

tglase@tglase:~ $ LOCPATH=$PWD/tlocx LC_ALL=en_US.UTF-8 ./a.out 
   
0041 1
0300 0
3000 2
4DC0 2
FFFD 1

This is because /usr/share/i18n/charmaps/UTF-8.gz now lacks
entries for 4DC0‥4FFF.

According to my own code implementing Unicode in another operating
system, with focus on wcwidth(3), after parsing EastAsianWidth.txt
special handling is needed to set widths of 0xFF00, 0x3248‥0x324F,
and 0x4DC0‥0x4DFF to “wide”, as they’re “neutral” normally – which
can be either – but display on a fixed-width terminal is otherwise
impossible. (Chars outside the BMP were not considered – there may
be others needing such handling… personally, I’d consider at least
all emouji need to be fullwidth but there’s no standard backing it
yet.)

Rationale here: compatibility with wcwidth(3) implementations such
as the one in xterm. (I’ve done the code in MirBSD to generate the
data for my new wcwidth(3) implementation carefully so that – when
using the same Unicode version as Markus Kuhn did – both implemen‐
tations return the same width for all characters.)

This is especially important as I happen to use ䷐ (U+4DD0) for UI
elements, and now all I get is a half-width replacement character,
due to X11 font selection choosing the half-width font part, for a
full-width character cell with an empty right half.

-- System Information:
Debian Release: stretch/sid
  APT prefers unreleased
  APT policy: (500, 'unreleased'), (500, 'buildd-unstable'), (500, 'unstable')
Architecture: x32 (x86_64)
Foreign Architectures: i386, amd64

Kernel: Linux 4.5.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=C, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/lksh
Init: sysvinit (via /sbin/init)

Versions of packages locales depends on:
ii  debconf [debconf-2.0]  1.5.59
ii  libc-bin   2.22-10
ii  libc-l10n  2.22-10

locales recommends no packages.

locales suggests no packages.

-- debconf information:
  locales/locales_to_be_generated:
  locales/default_environment_locale: None