Re: fix wcwidth (was: Re: ls(1) multibyte support)
Stefan Sperling s...@openbsd.org wrote: Change number 2 is the only one that hasn't been committed yet. According to http://pubs.opengroup.org/onlinepubs/009695399/functions/wcwidth.html wcwidth() should return -1 for non-printable characters. So this change looks good to me. Anyone want to ok it? FWIW, this fixes a bash prompt problem. (When you have a bash prompt that contains unprintable characters, properly bracketed by \[ \], and are in a UTF-8 locale, bash currently becomes confused about the width of the prompt.) -- Christian naddy Weisgerber na...@mips.inka.de
Re: Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)
On Sat, Jan 15, 2011 at 12:44:51AM +0100, Stefan Sperling wrote: On Fri, Jan 14, 2011 at 05:21:46PM +0100, Stefan Sperling wrote: On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote: * Alexander Polakov polac...@gmail.com [110105 17:20]: Hi, here's an updated version. 1) en_US.UTF-8.src updates from FreeBSD Let's start with those. These changes are all fine, I checked them against Unicode 5.2. http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf The diff below (from Alexander) brings us up to par with FreeBSD. Many updates could be made to this file to support additional characters listed in Unicode 5.2.0 (or even 6.0.0). But that can be done later. Can someone ok this? Thanks in advance. Before the ctype changes can go in, we'll need to this part from Alexander's diff to fix mklocale (caught by nicm@, thanks!) Can this go in now? Any OKs? Index: lib/libc/locale/runetype.h === RCS file: /cvs/src/lib/libc/locale/runetype.h,v retrieving revision 1.5 diff -u -p -r1.5 runetype.h --- lib/libc/locale/runetype.h 8 Oct 2007 08:17:15 - 1.5 +++ lib/libc/locale/runetype.h 14 Jan 2011 23:34:28 - @@ -69,9 +69,9 @@ typedef uint32_t _RuneType; #define_RUNETYPE_I 0x0008U /* Ideogram */ #define_RUNETYPE_T 0x0010U /* Special */ #define_RUNETYPE_Q 0x0020U /* Phonogram */ -#define_RUNETYPE_SWM 0xc000U/* Mask to get screen width data */ +#define_RUNETYPE_SWM 0xe000U /* Mask to get screen width data */ #define_RUNETYPE_SWS 30 /* Bits to shift to get width */ -#define_RUNETYPE_SW0 0xU /* 0 width character */ +#define_RUNETYPE_SW0 0x2000U /* 0 width character */ #define_RUNETYPE_SW1 0x4000U /* 1 width character */ #define_RUNETYPE_SW2 0x8000U /* 2 width character */ #define_RUNETYPE_SW3 0xc000U /* 3 width character */ Index: share/locale/ctype/en_US.UTF-8.src === RCS file: /cvs/src/share/locale/ctype/en_US.UTF-8.src,v retrieving revision 1.1 diff -u -p -r1.1 en_US.UTF-8.src --- share/locale/ctype/en_US.UTF-8.src 7 Aug 2005 10:03:45 - 1.1 +++ share/locale/ctype/en_US.UTF-8.src 15 Jan 2011 15:49:26 - @@ -491,9 +491,9 @@ SWIDTH1 0x02b0 - 0x02ee * U+0300 - U+036F : Combining Diacritical Marks */ -GRAPH 0x0300 - 0x034f 0x0360 - 0x036f -PRINT 0x0300 - 0x034f 0x0360 - 0x036f -SWIDTH1 0x0300 - 0x034f 0x0360 - 0x036f +GRAPH 0x0300 - 0x034e 0x0350 - 0x036f +PRINT 0x0300 - 0x034e 0x0350 - 0x036f +SWIDTH0 0x0300 - 0x034e 0x0350 - 0x036f MAPUPPER 0x0345 0x0399 @@ -583,7 +583,7 @@ LOWER 0x04b9 0x04bb 0x04bd 0x04bf LOWER 0x04c8 0x04ca 0x04cc 0x04ce 0x04d1 0x04d3 0x04d5 LOWER 0x04d7 0x04d9 0x04db 0x04dd 0x04df 0x04e1 0x04e3 LOWER 0x04e5 0x04e7 0x04e9 0x04eb 0x04ed 0x04ef 0x04f1 -LOWER 0x04f3 0x04f5 0x04f9 +LOWER 0x04f3 0x04f5 0x04f7 0x04f9 PUNCT 0x0482 UPPER 0x0400 - 0x042f 0x0460 0x0462 0x0464 0x0466 0x0468 UPPER 0x046a 0x046c 0x046e 0x0470 0x0472 0x0474 0x0476 @@ -595,9 +595,10 @@ UPPER 0x04b8 0x04ba 0x04bc 0x04be UPPER 0x04c5 0x04c7 0x04c9 0x04cb 0x04cd 0x04d0 0x04d2 UPPER 0x04d4 0x04d6 0x04d8 0x04da 0x04dc 0x04de 0x04e0 UPPER 0x04e2 0x04e4 0x04e6 0x04e8 0x04ea 0x04ec 0x04ee -UPPER 0x04f0 0x04f2 0x04f4 0x04f8 -PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 -SWIDTH1 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 +UPPER 0x04f0 0x04f2 0x04f4 0x04f6 0x04f8 +PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f9 +SWIDTH0 0x0483 - 0x0486 0x0488 - 0x0489 +SWIDTH1 0x0400 - 0x0482 0x048a - 0x04ce 0x04d0 - 0x04f9 MAPUPPER 0x0430 - 0x044f : 0x0410 MAPUPPER 0x0450 - 0x045f : 0x0400 @@ -671,6 +672,7 @@ MAPUPPER 0x04ef 0x04ee MAPUPPER 0x04f1 0x04f0 MAPUPPER 0x04f3 0x04f2 MAPUPPER 0x04f5 0x04f4 +MAPUPPER 0x04f7 0x04f6 MAPUPPER 0x04f9 0x04f8 MAPLOWER 0x0400 - 0x040f : 0x0450 MAPLOWER 0x0410 - 0x042f : 0x0430 @@ -744,6 +746,7 @@ MAPLOWER 0x04ee 0x04ef MAPLOWER 0x04f0 0x04f1 MAPLOWER 0x04f2 0x04f3 MAPLOWER 0x04f4 0x04f5 +MAPLOWER 0x04f6 0x04f7 MAPLOWER 0x04f8 0x04f9 @@ -1052,7 +1055,8 @@ DIGIT 0x0e50 - 0x0e59 GRAPH 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b PUNCT 0x0e3f 0x0e4f 0x0e5a 0x0e5b PRINT 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b -SWIDTH1 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b +SWIDTH0 0x0e31 0x0e34 - 0x0e3a 0x0e47 - 0x0e4e +SWIDTH1 0x0e01 - 0x0e30 0x0e32 - 0x0e33 0x0e3f - 0x0e46 0x0e4f - 0x0e5b TODIGIT0x0e50 - 0x0e59 : 0x @@ -1283,6 +1287,14 @@ SWIDTH1 0x1800 - 0x180d
Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)
On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote: * Alexander Polakov polac...@gmail.com [110105 17:20]: Hi, here's an updated version. 1) en_US.UTF-8.src updates from FreeBSD Let's start with those. These changes are all fine, I checked them against Unicode 5.2. http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf The diff below (from Alexander) brings us up to par with FreeBSD. Many updates could be made to this file to support additional characters listed in Unicode 5.2.0 (or even 6.0.0). But that can be done later. Can someone ok this? Thanks in advance. Index: share/locale/ctype/en_US.UTF-8.src === RCS file: /OpenBSD/src/share/locale/ctype/en_US.UTF-8.src,v retrieving revision 1.1 diff -u -r1.1 en_US.UTF-8.src --- share/locale/ctype/en_US.UTF-8.src 7 Aug 2005 10:03:45 - 1.1 +++ share/locale/ctype/en_US.UTF-8.src 6 Jan 2011 16:24:39 - @@ -491,9 +491,9 @@ * U+0300 - U+036F : Combining Diacritical Marks */ -GRAPH 0x0300 - 0x034f 0x0360 - 0x036f -PRINT 0x0300 - 0x034f 0x0360 - 0x036f -SWIDTH1 0x0300 - 0x034f 0x0360 - 0x036f +GRAPH 0x0300 - 0x034e 0x0350 - 0x036f +PRINT 0x0300 - 0x034e 0x0350 - 0x036f +SWIDTH0 0x0300 - 0x034e 0x0350 - 0x036f MAPUPPER 0x0345 0x0399 @@ -583,7 +583,7 @@ LOWER 0x04c8 0x04ca 0x04cc 0x04ce 0x04d1 0x04d3 0x04d5 LOWER 0x04d7 0x04d9 0x04db 0x04dd 0x04df 0x04e1 0x04e3 LOWER 0x04e5 0x04e7 0x04e9 0x04eb 0x04ed 0x04ef 0x04f1 -LOWER 0x04f3 0x04f5 0x04f9 +LOWER 0x04f3 0x04f5 0x04f7 0x04f9 PUNCT 0x0482 UPPER 0x0400 - 0x042f 0x0460 0x0462 0x0464 0x0466 0x0468 UPPER 0x046a 0x046c 0x046e 0x0470 0x0472 0x0474 0x0476 @@ -595,9 +595,10 @@ UPPER 0x04c5 0x04c7 0x04c9 0x04cb 0x04cd 0x04d0 0x04d2 UPPER 0x04d4 0x04d6 0x04d8 0x04da 0x04dc 0x04de 0x04e0 UPPER 0x04e2 0x04e4 0x04e6 0x04e8 0x04ea 0x04ec 0x04ee -UPPER 0x04f0 0x04f2 0x04f4 0x04f8 -PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 -SWIDTH1 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 +UPPER 0x04f0 0x04f2 0x04f4 0x04f6 0x04f8 +PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f9 +SWIDTH0 0x0483 - 0x0486 0x0488 - 0x0489 +SWIDTH1 0x0400 - 0x0482 0x048a - 0x04ce 0x04d0 - 0x04f9 MAPUPPER 0x0430 - 0x044f : 0x0410 MAPUPPER 0x0450 - 0x045f : 0x0400 @@ -671,6 +672,7 @@ MAPUPPER 0x04f1 0x04f0 MAPUPPER 0x04f3 0x04f2 MAPUPPER 0x04f5 0x04f4 +MAPUPPER 0x04f7 0x04f6 MAPUPPER 0x04f9 0x04f8 MAPLOWER 0x0400 - 0x040f : 0x0450 MAPLOWER 0x0410 - 0x042f : 0x0430 @@ -744,6 +746,7 @@ MAPLOWER 0x04f0 0x04f1 MAPLOWER 0x04f2 0x04f3 MAPLOWER 0x04f4 0x04f5 +MAPLOWER 0x04f6 0x04f7 MAPLOWER 0x04f8 0x04f9 @@ -1052,7 +1055,8 @@ GRAPH 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b PUNCT 0x0e3f 0x0e4f 0x0e5a 0x0e5b PRINT 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b -SWIDTH1 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b +SWIDTH0 0x0e31 0x0e34 - 0x0e3a 0x0e47 - 0x0e4e +SWIDTH1 0x0e01 - 0x0e30 0x0e32 - 0x0e33 0x0e3f - 0x0e46 0x0e4f - 0x0e5b TODIGIT0x0e50 - 0x0e59 : 0x @@ -1283,6 +1287,14 @@ TODIGIT0x1810 - 0x1819 : 0x +/* + * U+1DC0 - U+1DFF : Combining Diacritical Marks Supplement + */ + +GRAPH 0x1DC0 - 0x1DC3 +PRINT 0x1DC0 - 0x1DC3 +SWIDTH0 0x1DC0 - 0x1DC3 + /* * U+1E00 - U+1EFF : Latin Extended Additional @@ -1672,7 +1684,8 @@ BLANK 0x2000 - 0x200b 0x202f 0x205f PRINT 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 PRINT 0x205f -SWIDTH1 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH1 0x2000 - 0x200a 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH0 0x200b - 0x200d SWIDTH1 0x205f @@ -1707,9 +1720,9 @@ * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols */ -GRAPH 0x20d0 - 0x20ea -PRINT 0x20d0 - 0x20ea -SWIDTH1 0x20d0 - 0x20ea +GRAPH 0x20d0 - 0x20eb +PRINT 0x20d0 - 0x20eb +SWIDTH0 0x20d0 - 0x20eb /* @@ -1987,7 +2000,8 @@ PUNCT 0x309b 0x309c PRINT 0x3041 - 0x3096 0x3099 - 0x309f PHONOGRAM 0x3041 - 0x3096 0x309f -SWIDTH2 0x3041 - 0x3096 0x3099 - 0x309f +SWIDTH0 0x3099 - 0x309a +SWIDTH2 0x3041 - 0x3096 0x309b - 0x309f /* @@ -2211,7 +2225,7 @@ GRAPH 0xfe20 - 0xfe23 PRINT 0xfe20 - 0xfe23 -SWIDTH1 0xfe20 - 0xfe23 +SWIDTH0 0xfe20 - 0xfe23 /* @@ -2333,8 +2347,13 @@ GRAPH 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd PUNCT 0x1d100 - 0x1d126 0x1d12a - 0x1d164 0x1d16a - 0x1d16c PUNCT 0x1d183 0x1d184 0x1d18c - 0x1d1a9 0x1d1ae - 0x1d1dd -PRINT 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd -SWIDTH1 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd +PRINT 0x1d100 - 0x1d126 0x1d12a - 0x1d158 0x1d15a - 0x1d172 +PRINT 0x1d17b - 0x1d1dd +SWIDTH0 0x1d165
Re: Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)
On Fri, Jan 14, 2011 at 05:21:46PM +0100, Stefan Sperling wrote: On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote: * Alexander Polakov polac...@gmail.com [110105 17:20]: Hi, here's an updated version. 1) en_US.UTF-8.src updates from FreeBSD Let's start with those. These changes are all fine, I checked them against Unicode 5.2. http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf The diff below (from Alexander) brings us up to par with FreeBSD. Many updates could be made to this file to support additional characters listed in Unicode 5.2.0 (or even 6.0.0). But that can be done later. Can someone ok this? Thanks in advance. Before the ctype changes can go in, we'll need to this part from Alexander's diff to fix mklocale (caught by nicm@, thanks!) These symbols are internal to libc, with exception of mklocale. Can this go in during ABI lock? Index: lib/libc/locale/runetype.h === RCS file: /OpenBSD/src/lib/libc/locale/runetype.h,v retrieving revision 1.5 diff -u -r1.5 runetype.h --- lib/libc/locale/runetype.h 8 Oct 2007 08:17:15 - 1.5 +++ lib/libc/locale/runetype.h 6 Jan 2011 16:24:20 - @@ -69,9 +69,9 @@ #define_RUNETYPE_I 0x0008U /* Ideogram */ #define_RUNETYPE_T 0x0010U /* Special */ #define_RUNETYPE_Q 0x0020U /* Phonogram */ -#define_RUNETYPE_SWM 0xc000U/* Mask to get screen width data */ +#define_RUNETYPE_SWM 0xe000U /* Mask to get screen width data */ #define_RUNETYPE_SWS 30 /* Bits to shift to get width */ -#define_RUNETYPE_SW0 0xU /* 0 width character */ +#define_RUNETYPE_SW0 0x2000U /* 0 width character */ #define_RUNETYPE_SW1 0x4000U /* 1 width character */ #define_RUNETYPE_SW2 0x8000U /* 2 width character */ #define_RUNETYPE_SW3 0xc000U /* 3 width character */
Re: ls(1) multibyte support
* Alexander Polakov polac...@gmail.com [110105 17:20]: Hi, here's an updated version. 1) en_US.UTF-8.src updates from FreeBSD 2) wcwidth() changed to use the same code as iswprint() * maybe just use iswprint() itself? 3) _RUNETYPE_SW0 changed to be !0 (and match FreeBSD). 0 value is used in mklocale to perform additional checks required for MAPLOWER and MAPUPPER, but not SWIDTHx. 4) _RUNETYPE_SWM changed to make (r_RUNETYPE_SWM) == _RUNETYPE_SW0 work Index: lib/libc/locale/iswctype.c === RCS file: /OpenBSD/src/lib/libc/locale/iswctype.c,v retrieving revision 1.1 diff -u -r1.1 iswctype.c --- lib/libc/locale/iswctype.c 7 Aug 2005 10:16:23 - 1.1 +++ lib/libc/locale/iswctype.c 6 Jan 2011 16:24:20 - @@ -170,7 +170,9 @@ int wcwidth(wchar_t c) { -return (((unsigned)__runetype_w(c) _CTYPE_SWM) _CTYPE_SWS); + if (__isctype_w((c), _CTYPE_R)) + return (((unsigned)__runetype_w(c) _CTYPE_SWM) _CTYPE_SWS); + return -1; } wctrans_t Index: lib/libc/locale/runetype.h === RCS file: /OpenBSD/src/lib/libc/locale/runetype.h,v retrieving revision 1.5 diff -u -r1.5 runetype.h --- lib/libc/locale/runetype.h 8 Oct 2007 08:17:15 - 1.5 +++ lib/libc/locale/runetype.h 6 Jan 2011 16:24:20 - @@ -69,9 +69,9 @@ #define_RUNETYPE_I 0x0008U /* Ideogram */ #define_RUNETYPE_T 0x0010U /* Special */ #define_RUNETYPE_Q 0x0020U /* Phonogram */ -#define_RUNETYPE_SWM 0xc000U/* Mask to get screen width data */ +#define_RUNETYPE_SWM 0xe000U /* Mask to get screen width data */ #define_RUNETYPE_SWS 30 /* Bits to shift to get width */ -#define_RUNETYPE_SW0 0xU /* 0 width character */ +#define_RUNETYPE_SW0 0x2000U /* 0 width character */ #define_RUNETYPE_SW1 0x4000U /* 1 width character */ #define_RUNETYPE_SW2 0x8000U /* 2 width character */ #define_RUNETYPE_SW3 0xc000U /* 3 width character */ Index: share/locale/ctype/en_US.UTF-8.src === RCS file: /OpenBSD/src/share/locale/ctype/en_US.UTF-8.src,v retrieving revision 1.1 diff -u -r1.1 en_US.UTF-8.src --- share/locale/ctype/en_US.UTF-8.src 7 Aug 2005 10:03:45 - 1.1 +++ share/locale/ctype/en_US.UTF-8.src 6 Jan 2011 16:24:39 - @@ -491,9 +491,9 @@ * U+0300 - U+036F : Combining Diacritical Marks */ -GRAPH 0x0300 - 0x034f 0x0360 - 0x036f -PRINT 0x0300 - 0x034f 0x0360 - 0x036f -SWIDTH1 0x0300 - 0x034f 0x0360 - 0x036f +GRAPH 0x0300 - 0x034e 0x0350 - 0x036f +PRINT 0x0300 - 0x034e 0x0350 - 0x036f +SWIDTH0 0x0300 - 0x034e 0x0350 - 0x036f MAPUPPER 0x0345 0x0399 @@ -583,7 +583,7 @@ LOWER 0x04c8 0x04ca 0x04cc 0x04ce 0x04d1 0x04d3 0x04d5 LOWER 0x04d7 0x04d9 0x04db 0x04dd 0x04df 0x04e1 0x04e3 LOWER 0x04e5 0x04e7 0x04e9 0x04eb 0x04ed 0x04ef 0x04f1 -LOWER 0x04f3 0x04f5 0x04f9 +LOWER 0x04f3 0x04f5 0x04f7 0x04f9 PUNCT 0x0482 UPPER 0x0400 - 0x042f 0x0460 0x0462 0x0464 0x0466 0x0468 UPPER 0x046a 0x046c 0x046e 0x0470 0x0472 0x0474 0x0476 @@ -595,9 +595,10 @@ UPPER 0x04c5 0x04c7 0x04c9 0x04cb 0x04cd 0x04d0 0x04d2 UPPER 0x04d4 0x04d6 0x04d8 0x04da 0x04dc 0x04de 0x04e0 UPPER 0x04e2 0x04e4 0x04e6 0x04e8 0x04ea 0x04ec 0x04ee -UPPER 0x04f0 0x04f2 0x04f4 0x04f8 -PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 -SWIDTH1 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 +UPPER 0x04f0 0x04f2 0x04f4 0x04f6 0x04f8 +PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f9 +SWIDTH0 0x0483 - 0x0486 0x0488 - 0x0489 +SWIDTH1 0x0400 - 0x0482 0x048a - 0x04ce 0x04d0 - 0x04f9 MAPUPPER 0x0430 - 0x044f : 0x0410 MAPUPPER 0x0450 - 0x045f : 0x0400 @@ -671,6 +672,7 @@ MAPUPPER 0x04f1 0x04f0 MAPUPPER 0x04f3 0x04f2 MAPUPPER 0x04f5 0x04f4 +MAPUPPER 0x04f7 0x04f6 MAPUPPER 0x04f9 0x04f8 MAPLOWER 0x0400 - 0x040f : 0x0450 MAPLOWER 0x0410 - 0x042f : 0x0430 @@ -744,6 +746,7 @@ MAPLOWER 0x04f0 0x04f1 MAPLOWER 0x04f2 0x04f3 MAPLOWER 0x04f4 0x04f5 +MAPLOWER 0x04f6 0x04f7 MAPLOWER 0x04f8 0x04f9 @@ -1052,7 +1055,8 @@ GRAPH 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b PUNCT 0x0e3f 0x0e4f 0x0e5a 0x0e5b PRINT 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b -SWIDTH1 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b +SWIDTH0 0x0e31 0x0e34 - 0x0e3a 0x0e47 - 0x0e4e +SWIDTH1 0x0e01 - 0x0e30 0x0e32 - 0x0e33 0x0e3f - 0x0e46 0x0e4f - 0x0e5b TODIGIT0x0e50 - 0x0e59 : 0x @@ -1283,6 +1287,14 @@ TODIGIT0x1810 - 0x1819 : 0x +/* + * U+1DC0 - U+1DFF : Combining Diacritical Marks
Re: ls(1) multibyte support
Hi, here's an updated version. 1) en_US.UTF-8.src updates from FreeBSD 2) wcwidth() changed to use the same code as iswprint() * maybe just use iswprint() itself? 3) _RUNETYPE_SW0 changed to be !0 (and match FreeBSD). 0 value is used in mklocale to perform additional checks required for MAPLOWER and MAPUPPER, but not SWIDTHx. --- share/locale/ctype/en_US.UTF-8.src Wed Jan 5 12:37:22 2011 +++ share/locale/ctype/en_US.UTF-8.src Wed Jan 5 09:47:56 2011 @@ -491,9 +491,9 @@ * U+0300 - U+036F : Combining Diacritical Marks */ -GRAPH 0x0300 - 0x034f 0x0360 - 0x036f -PRINT 0x0300 - 0x034f 0x0360 - 0x036f -SWIDTH1 0x0300 - 0x034f 0x0360 - 0x036f +GRAPH 0x0300 - 0x034e 0x0350 - 0x036f +PRINT 0x0300 - 0x034e 0x0350 - 0x036f +SWIDTH0 0x0300 - 0x034e 0x0350 - 0x036f MAPUPPER 0x0345 0x0399 @@ -583,7 +583,7 @@ LOWER 0x04c8 0x04ca 0x04cc 0x04ce 0x04d1 0x04d3 0x04d5 LOWER 0x04d7 0x04d9 0x04db 0x04dd 0x04df 0x04e1 0x04e3 LOWER 0x04e5 0x04e7 0x04e9 0x04eb 0x04ed 0x04ef 0x04f1 -LOWER 0x04f3 0x04f5 0x04f9 +LOWER 0x04f3 0x04f5 0x04f7 0x04f9 PUNCT 0x0482 UPPER 0x0400 - 0x042f 0x0460 0x0462 0x0464 0x0466 0x0468 UPPER 0x046a 0x046c 0x046e 0x0470 0x0472 0x0474 0x0476 @@ -595,9 +595,10 @@ UPPER 0x04c5 0x04c7 0x04c9 0x04cb 0x04cd 0x04d0 0x04d2 UPPER 0x04d4 0x04d6 0x04d8 0x04da 0x04dc 0x04de 0x04e0 UPPER 0x04e2 0x04e4 0x04e6 0x04e8 0x04ea 0x04ec 0x04ee -UPPER 0x04f0 0x04f2 0x04f4 0x04f8 -PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 -SWIDTH1 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f5 0x04f8 0x04f9 +UPPER 0x04f0 0x04f2 0x04f4 0x04f6 0x04f8 +PRINT 0x0400 - 0x0486 0x0488 - 0x04ce 0x04d0 - 0x04f9 +SWIDTH0 0x0483 - 0x0486 0x0488 - 0x0489 +SWIDTH1 0x0400 - 0x0482 0x048a - 0x04ce 0x04d0 - 0x04f9 MAPUPPER 0x0430 - 0x044f : 0x0410 MAPUPPER 0x0450 - 0x045f : 0x0400 @@ -671,6 +672,7 @@ MAPUPPER 0x04f1 0x04f0 MAPUPPER 0x04f3 0x04f2 MAPUPPER 0x04f5 0x04f4 +MAPUPPER 0x04f7 0x04f6 MAPUPPER 0x04f9 0x04f8 MAPLOWER 0x0400 - 0x040f : 0x0450 MAPLOWER 0x0410 - 0x042f : 0x0430 @@ -744,6 +746,7 @@ MAPLOWER 0x04f0 0x04f1 MAPLOWER 0x04f2 0x04f3 MAPLOWER 0x04f4 0x04f5 +MAPLOWER 0x04f6 0x04f7 MAPLOWER 0x04f8 0x04f9 @@ -1052,7 +1055,8 @@ GRAPH 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b PUNCT 0x0e3f 0x0e4f 0x0e5a 0x0e5b PRINT 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b -SWIDTH1 0x0e01 - 0x0e3a 0x0e3f - 0x0e5b +SWIDTH0 0x0e31 0x0e34 - 0x0e3a 0x0e47 - 0x0e4e +SWIDTH1 0x0e01 - 0x0e30 0x0e32 - 0x0e33 0x0e3f - 0x0e46 0x0e4f - 0x0e5b TODIGIT0x0e50 - 0x0e59 : 0x @@ -1283,7 +1287,15 @@ TODIGIT0x1810 - 0x1819 : 0x +/* + * U+1DC0 - U+1DFF : Combining Diacritical Marks Supplement + */ +GRAPH 0x1DC0 - 0x1DC3 +PRINT 0x1DC0 - 0x1DC3 +SWIDTH0 0x1DC0 - 0x1DC3 + + /* * U+1E00 - U+1EFF : Latin Extended Additional */ @@ -1672,7 +1684,8 @@ BLANK 0x2000 - 0x200b 0x202f 0x205f PRINT 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 PRINT 0x205f -SWIDTH1 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH1 0x2000 - 0x200a 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH0 0x200b - 0x200d SWIDTH1 0x205f @@ -1707,9 +1720,9 @@ * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols */ -GRAPH 0x20d0 - 0x20ea -PRINT 0x20d0 - 0x20ea -SWIDTH1 0x20d0 - 0x20ea +GRAPH 0x20d0 - 0x20eb +PRINT 0x20d0 - 0x20eb +SWIDTH0 0x20d0 - 0x20eb /* @@ -1987,7 +2000,8 @@ PUNCT 0x309b 0x309c PRINT 0x3041 - 0x3096 0x3099 - 0x309f PHONOGRAM 0x3041 - 0x3096 0x309f -SWIDTH2 0x3041 - 0x3096 0x3099 - 0x309f +SWIDTH0 0x3099 - 0x309a +SWIDTH2 0x3041 - 0x3096 0x309b - 0x309f /* @@ -2211,7 +2225,7 @@ GRAPH 0xfe20 - 0xfe23 PRINT 0xfe20 - 0xfe23 -SWIDTH1 0xfe20 - 0xfe23 +SWIDTH0 0xfe20 - 0xfe23 /* @@ -2333,8 +2347,13 @@ GRAPH 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd PUNCT 0x1d100 - 0x1d126 0x1d12a - 0x1d164 0x1d16a - 0x1d16c PUNCT 0x1d183 0x1d184 0x1d18c - 0x1d1a9 0x1d1ae - 0x1d1dd -PRINT 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd -SWIDTH1 0x1d100 - 0x1d126 0x1d12a - 0x1d172 0x1d17b - 0x1d1dd +PRINT 0x1d100 - 0x1d126 0x1d12a - 0x1d158 0x1d15a - 0x1d172 +PRINT 0x1d17b - 0x1d1dd +SWIDTH0 0x1d165 - 0x1d169 0x1d16d - 0x1d172 0x1d17b - 0x1d182 +SWIDTH0 0x1d185 - 0x1d18b 0x1d1aa - 0x1d1ad +SWIDTH1 0x1d100 - 0x1d126 0x1d12a - 0x1d158 0x1d15a - 0x1d164 +SWIDTH1 0x1d16a - 0x1d16c 0x1d183 0x1d184 0x1d18c - 0x1d1a9 +SWIDTH1 0x1d1ae - 0x1d1dd /* --- lib/libc/locale/iswctype.c.orig Tue Jan 4 23:12:23 2011 +++ lib/libc/locale/iswctype.c Wed Jan 5 10:02:36 2011 @@ -170,7 +170,9 @@ int wcwidth(wchar_t c) { -return (((unsigned)__runetype_w(c) _CTYPE_SWM)
Re: ls(1) multibyte support
On Tue, Jan 04, 2011 at 09:14:51PM +0300, Alexander Polakov wrote: Hi, I wonder if there any plans on adding multibyte support for ls(1)? Or maybe there's a reason why it's not a great idea (which I am not aware of)? Anyway, here's a patch I have. It's based on DragonFlyBSD's ls. Any locale stuff added to applications that are used on the ramdisk (bsd.rd) must be inside #ifndef SMALL. The ls binary is linked statically so we need to prevent it from wasting space by pulling citrus stuff onto the ramdisk. More importantly, there is an alleged bug in our wcwidth() implementation. I haven't had time to investigate, but it has been pointed out on separate occasions, by Jordi Beltran Creix and by n...@. Test program (from Jordi): #include stdio.h #include locale.h main () { setlocale(LC_ALL, ); printf(%d %d %d %d\n, wcwidth(0x53DA), wcwidth('A'), wcwidth(0x200B), wcwidth(0x1F)); return 0; } Output is 2, 1, 1, 0, should be 2, 1, 0, -1 (according to Jordi). We should make sure that wcwidth() is working properly before changing applications to use it. We also need a wcwidth() man page. FWIW, below is a diff that Jordi sent me some time ago to fix ls(1). It also depends on wcwidth(). Index: ls.c === RCS file: /cvs/src/bin/ls/ls.c,v retrieving revision 1.35 diff -u -p -r1.35 ls.c --- ls.c27 Oct 2009 23:59:21 - 1.35 +++ ls.c7 Aug 2010 09:16:03 - @@ -48,6 +48,8 @@ #include string.h #include unistd.h #include util.h +#include locale.h +#include wchar.h #include ls.h #include extern.h @@ -102,6 +104,10 @@ ls_main(int argc, char *argv[]) int kflag = 0; char *p; +#ifndef SMALL + setlocale(LC_ALL, ); + +#endif /* Terminal defaults to -Cq, non-terminal defaults to -1. */ if (isatty(STDOUT_FILENO)) { if ((p = getenv(COLUMNS)) != NULL) @@ -396,6 +402,32 @@ traverse(int argc, char *argv[], int opt err(1, fts_read); } +#ifndef SMALL +static int +mbswidth(const char *s) +{ + wchar_t wc; + size_t wclen; + mbstate_t mbs; + int width = 0; + + bzero(mbs, sizeof(mbs)); + + while (*s) { + wclen = mbrtowc(wc, s, MB_CUR_MAX, mbs); + if (wclen 0 || !iswprint(wc)) { + if (wclen 0) + wclen = 1; + width++; + } else { + width += wcwidth(wc); + } + s += wclen; + } + return width; +} +#endif + /* * Display() takes a linked list of FTSENT structures and passes the list * along with any other necessary information to the print function. P @@ -458,8 +490,13 @@ display(FTSENT *p, FTSENT *list) continue; } } +#ifndef SMALL + if (mbswidth(cur-fts_name) maxlen) + maxlen = mbswidth(cur-fts_name); +#else if (cur-fts_namelen maxlen) maxlen = cur-fts_namelen; +#endif if (needstats) { sp = cur-fts_statp; if (sp-st_blocks maxblock) Index: util.c === RCS file: /cvs/src/bin/ls/util.c,v retrieving revision 1.14 diff -u -p -r1.14 util.c --- util.c 27 Oct 2009 23:59:21 - 1.14 +++ util.c 7 Aug 2010 09:16:03 - @@ -41,6 +41,7 @@ #include stdio.h #include stdlib.h #include string.h +#include wchar.h #include ls.h #include extern.h @@ -49,9 +50,26 @@ int putname(char *name) { int len; - +#ifndef SMALL + size_t wclen; + wchar_t wc; + mbstate_t mbs; + + bzero(mbs, sizeof(mbs)); + for (len = 0; *name; len += wcwidth(wc), name += wclen) { + wclen=mbrtowc(wc, name, MB_CUR_MAX, mbs); + if (wclen 0) { + wclen = 1; + wc = '?'; + } else { + wc = (!iswprint(wc) f_nonprint) ? '?' : wc; + } + putwchar(wc); + } +#else for (len = 0; *name; len++, name++) putchar((!isprint(*name) f_nonprint) ? '?' : *name); +#endif return len; }
Re: ls(1) multibyte support
* Stefan Sperling s...@stsp.name [110104 23:12]: On Tue, Jan 04, 2011 at 09:14:51PM +0300, Alexander Polakov wrote: Hi, I wonder if there any plans on adding multibyte support for ls(1)? Or maybe there's a reason why it's not a great idea (which I am not aware of)? Anyway, here's a patch I have. It's based on DragonFlyBSD's ls. Any locale stuff added to applications that are used on the ramdisk (bsd.rd) must be inside #ifndef SMALL. The ls binary is linked statically so we need to prevent it from wasting space by pulling citrus stuff onto the ramdisk. Sure. More importantly, there is an alleged bug in our wcwidth() implementation. I haven't had time to investigate, but it has been pointed out on separate occasions, by Jordi Beltran Creix and by n...@. Test program (from Jordi): #include stdio.h #include locale.h main () { setlocale(LC_ALL, ); printf(%d %d %d %d\n, wcwidth(0x53DA), wcwidth('A'), wcwidth(0x200B), wcwidth(0x1F)); return 0; } Output is 2, 1, 1, 0, should be 2, 1, 0, -1 (according to Jordi). We should make sure that wcwidth() is working properly before changing applications to use it. We also need a wcwidth() man page. I think there're 2 separate bugs and I have 2 fixes (neither one tested). 1) wcwidth(0x200B) This if from http://unicode.org/Public/UNIDATA/ : 200B;ZERO WIDTH SPACE;Cf;0;BN;N; 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;N; 200D;ZERO WIDTH JOINER;Cf;0;BN;N; --- share/locale/ctype/en_US.UTF-8.src.orig Tue Jan 4 22:49:22 2011 +++ share/locale/ctype/en_US.UTF-8.src Tue Jan 4 22:50:55 2011 @@ -1672,7 +1672,8 @@ BLANK 0x2000 - 0x200b 0x202f 0x205f PRINT 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 PRINT 0x205f -SWIDTH1 0x2000 - 0x200b 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH1 0x2000 - 0x200c 0x2010 - 0x2029 0x202f - 0x2052 0x2057 +SWIDTH0 0x200b - 0x200d SWIDTH1 0x205f 2) wcwidth(0x1f) DragonFly's man page for wcwidth(3) says that function returns -1 if character is not printable. _RUNETYPE_R is the flag to check. --- lib/libc/locale/iswctype.c.orig Tue Jan 4 23:12:23 2011 +++ lib/libc/locale/iswctype.c Tue Jan 4 23:02:37 2011 @@ -170,7 +170,11 @@ int wcwidth(wchar_t c) { -return (((unsigned)__runetype_w(c) _CTYPE_SWM) _CTYPE_SWS); + _RuneType r; + r = __runetype_w(c); + if (r _RUNETYPE_R) + return (((unsigned)r _CTYPE_SWM) _CTYPE_SWS); + return -1; } wctrans_t Again, I don't have hardware at hand to build libc so this is completely untested.
Re: ls(1) multibyte support
2011/1/5 Alexander Polakov polac...@gmail.com: 1) wcwidth(0x200B) This if from http://unicode.org/Public/UNIDATA/ : 200B;ZERO WIDTH SPACE;Cf;0;BN;N; 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;N; 200D;ZERO WIDTH JOINER;Cf;0;BN;N; --- share/locale/ctype/en_US.UTF-8.src.orig B B Tue Jan B 4 22:49:22 2011 +++ share/locale/ctype/en_US.UTF-8.src B Tue Jan B 4 22:50:55 2011 @@ -1672,7 +1672,8 @@ B BLANK B B 0x2000 - 0x200b B 0x202f B 0x205f B PRINT B B 0x2000 - 0x200b B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057 B PRINT B B 0x205f -SWIDTH1 B 0x2000 - 0x200b B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057 +SWIDTH1 B 0x2000 - 0x200c B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057 +SWIDTH0 B 0x200b - 0x200d B SWIDTH1 B 0x205f That only solves the test case. All combining characters(diacritic marks), including 0x300, should be 0 width as well. Accepted interpretation of Unicode rules appears to be that Cf, Me and Mf categories +- a few characters are to be 0-spaced, see the comments in: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c That file also happens to be in xenocara/app/xterm/wcwidth.c so that was the behavior in xterm until(I assume) it started using the system version. The database file in OpenBSD is just too old, the same problem file was fixed in FreeBSD in 2006, see: http://code.bsd64.org/cvsweb/freebsd/src/share/mklocale/UTF-8.src