Re: fix wcwidth (was: Re: ls(1) multibyte support)

2011-04-04 Thread Christian Weisgerber
Stefan Sperling s...@openbsd.org wrote:

 Change number 2 is the only one that hasn't been committed yet.
 According to
 http://pubs.opengroup.org/onlinepubs/009695399/functions/wcwidth.html
 wcwidth() should return -1 for non-printable characters.
 So this change looks good to me. Anyone want to ok it?

FWIW, this fixes a bash prompt problem.

(When you have a bash prompt that contains unprintable characters,
properly bracketed by \[ \], and are in a UTF-8 locale, bash currently
becomes confused about the width of the prompt.)

-- 
Christian naddy Weisgerber  na...@mips.inka.de



Re: Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)

2011-03-04 Thread Stefan Sperling
On Sat, Jan 15, 2011 at 12:44:51AM +0100, Stefan Sperling wrote:
 On Fri, Jan 14, 2011 at 05:21:46PM +0100, Stefan Sperling wrote:
  On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote:
   * Alexander Polakov polac...@gmail.com [110105 17:20]:
Hi,

here's an updated version.

1) en_US.UTF-8.src updates from FreeBSD
  
  Let's start with those.
  
  These changes are all fine, I checked them against Unicode 5.2.
  http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf
  
  The diff below (from Alexander) brings us up to par with FreeBSD.
  Many updates could be made to this file to support additional
  characters listed in Unicode 5.2.0 (or even 6.0.0).
  But that can be done later.
  
  Can someone ok this? Thanks in advance.
 
 Before the ctype changes can go in, we'll need to this part from
 Alexander's diff to fix mklocale (caught by nicm@, thanks!)

Can this go in now?
Any OKs?

Index: lib/libc/locale/runetype.h
===
RCS file: /cvs/src/lib/libc/locale/runetype.h,v
retrieving revision 1.5
diff -u -p -r1.5 runetype.h
--- lib/libc/locale/runetype.h  8 Oct 2007 08:17:15 -   1.5
+++ lib/libc/locale/runetype.h  14 Jan 2011 23:34:28 -
@@ -69,9 +69,9 @@ typedef uint32_t _RuneType;
 #define_RUNETYPE_I 0x0008U /* Ideogram */
 #define_RUNETYPE_T 0x0010U /* Special */
 #define_RUNETYPE_Q 0x0020U /* Phonogram */
-#define_RUNETYPE_SWM   0xc000U/* Mask to get screen width data */
+#define_RUNETYPE_SWM   0xe000U /* Mask to get screen width 
data */
 #define_RUNETYPE_SWS   30  /* Bits to shift to get width */
-#define_RUNETYPE_SW0   0xU /* 0 width character */
+#define_RUNETYPE_SW0   0x2000U /* 0 width character */
 #define_RUNETYPE_SW1   0x4000U /* 1 width character */
 #define_RUNETYPE_SW2   0x8000U /* 2 width character */
 #define_RUNETYPE_SW3   0xc000U /* 3 width character */
Index: share/locale/ctype/en_US.UTF-8.src
===
RCS file: /cvs/src/share/locale/ctype/en_US.UTF-8.src,v
retrieving revision 1.1
diff -u -p -r1.1 en_US.UTF-8.src
--- share/locale/ctype/en_US.UTF-8.src  7 Aug 2005 10:03:45 -   1.1
+++ share/locale/ctype/en_US.UTF-8.src  15 Jan 2011 15:49:26 -
@@ -491,9 +491,9 @@ SWIDTH1   0x02b0 - 0x02ee
  * U+0300 - U+036F : Combining Diacritical Marks
  */
 
-GRAPH 0x0300 - 0x034f  0x0360 - 0x036f
-PRINT 0x0300 - 0x034f  0x0360 - 0x036f
-SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
+GRAPH 0x0300 - 0x034e  0x0350 - 0x036f
+PRINT 0x0300 - 0x034e  0x0350 - 0x036f
+SWIDTH0   0x0300 - 0x034e  0x0350 - 0x036f
 
 MAPUPPER   0x0345 0x0399 
 
@@ -583,7 +583,7 @@ LOWER 0x04b9  0x04bb  0x04bd  0x04bf
 LOWER 0x04c8  0x04ca  0x04cc  0x04ce  0x04d1  0x04d3  0x04d5
 LOWER 0x04d7  0x04d9  0x04db  0x04dd  0x04df  0x04e1  0x04e3
 LOWER 0x04e5  0x04e7  0x04e9  0x04eb  0x04ed  0x04ef  0x04f1
-LOWER 0x04f3  0x04f5  0x04f9
+LOWER 0x04f3  0x04f5  0x04f7  0x04f9
 PUNCT 0x0482
 UPPER 0x0400 - 0x042f  0x0460  0x0462  0x0464  0x0466  0x0468
 UPPER 0x046a  0x046c  0x046e  0x0470  0x0472  0x0474  0x0476
@@ -595,9 +595,10 @@ UPPER 0x04b8  0x04ba  0x04bc  0x04be
 UPPER 0x04c5  0x04c7  0x04c9  0x04cb  0x04cd  0x04d0  0x04d2
 UPPER 0x04d4  0x04d6  0x04d8  0x04da  0x04dc  0x04de  0x04e0
 UPPER 0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
-UPPER 0x04f0  0x04f2  0x04f4  0x04f8
-PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
-SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+UPPER 0x04f0  0x04f2  0x04f4  0x04f6  0x04f8
+PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f9
+SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
+SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f9
 
 MAPUPPER   0x0430 - 0x044f : 0x0410 
 MAPUPPER   0x0450 - 0x045f : 0x0400 
@@ -671,6 +672,7 @@ MAPUPPER   0x04ef 0x04ee 
 MAPUPPER   0x04f1 0x04f0 
 MAPUPPER   0x04f3 0x04f2 
 MAPUPPER   0x04f5 0x04f4 
+MAPUPPER   0x04f7 0x04f6 
 MAPUPPER   0x04f9 0x04f8 
 MAPLOWER   0x0400 - 0x040f : 0x0450 
 MAPLOWER   0x0410 - 0x042f : 0x0430 
@@ -744,6 +746,7 @@ MAPLOWER   0x04ee 0x04ef 
 MAPLOWER   0x04f0 0x04f1 
 MAPLOWER   0x04f2 0x04f3 
 MAPLOWER   0x04f4 0x04f5 
+MAPLOWER   0x04f6 0x04f7 
 MAPLOWER   0x04f8 0x04f9 
 
 
@@ -1052,7 +1055,8 @@ DIGIT 0x0e50 - 0x0e59
 GRAPH 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 PUNCT 0x0e3f  0x0e4f  0x0e5a  0x0e5b
 PRINT 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
-SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
+SWIDTH0   0x0e31   0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
+SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
 
 TODIGIT0x0e50 - 0x0e59 : 0x 
 
@@ -1283,6 +1287,14 @@ SWIDTH1   0x1800 - 0x180d  

Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)

2011-01-14 Thread Stefan Sperling
On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote:
 * Alexander Polakov polac...@gmail.com [110105 17:20]:
  Hi,
  
  here's an updated version.
  
  1) en_US.UTF-8.src updates from FreeBSD

Let's start with those.

These changes are all fine, I checked them against Unicode 5.2.
http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf

The diff below (from Alexander) brings us up to par with FreeBSD.
Many updates could be made to this file to support additional
characters listed in Unicode 5.2.0 (or even 6.0.0).
But that can be done later.

Can someone ok this? Thanks in advance.

Index: share/locale/ctype/en_US.UTF-8.src
===
RCS file: /OpenBSD/src/share/locale/ctype/en_US.UTF-8.src,v
retrieving revision 1.1
diff -u -r1.1 en_US.UTF-8.src
--- share/locale/ctype/en_US.UTF-8.src  7 Aug 2005 10:03:45 -   1.1
+++ share/locale/ctype/en_US.UTF-8.src  6 Jan 2011 16:24:39 -
@@ -491,9 +491,9 @@
  * U+0300 - U+036F : Combining Diacritical Marks
  */
 
-GRAPH 0x0300 - 0x034f  0x0360 - 0x036f
-PRINT 0x0300 - 0x034f  0x0360 - 0x036f
-SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
+GRAPH 0x0300 - 0x034e  0x0350 - 0x036f
+PRINT 0x0300 - 0x034e  0x0350 - 0x036f
+SWIDTH0   0x0300 - 0x034e  0x0350 - 0x036f
 
 MAPUPPER   0x0345 0x0399 
 
@@ -583,7 +583,7 @@
 LOWER 0x04c8  0x04ca  0x04cc  0x04ce  0x04d1  0x04d3  0x04d5
 LOWER 0x04d7  0x04d9  0x04db  0x04dd  0x04df  0x04e1  0x04e3
 LOWER 0x04e5  0x04e7  0x04e9  0x04eb  0x04ed  0x04ef  0x04f1
-LOWER 0x04f3  0x04f5  0x04f9
+LOWER 0x04f3  0x04f5  0x04f7  0x04f9
 PUNCT 0x0482
 UPPER 0x0400 - 0x042f  0x0460  0x0462  0x0464  0x0466  0x0468
 UPPER 0x046a  0x046c  0x046e  0x0470  0x0472  0x0474  0x0476
@@ -595,9 +595,10 @@
 UPPER 0x04c5  0x04c7  0x04c9  0x04cb  0x04cd  0x04d0  0x04d2
 UPPER 0x04d4  0x04d6  0x04d8  0x04da  0x04dc  0x04de  0x04e0
 UPPER 0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
-UPPER 0x04f0  0x04f2  0x04f4  0x04f8
-PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
-SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+UPPER 0x04f0  0x04f2  0x04f4  0x04f6  0x04f8
+PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f9
+SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
+SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f9
 
 MAPUPPER   0x0430 - 0x044f : 0x0410 
 MAPUPPER   0x0450 - 0x045f : 0x0400 
@@ -671,6 +672,7 @@
 MAPUPPER   0x04f1 0x04f0 
 MAPUPPER   0x04f3 0x04f2 
 MAPUPPER   0x04f5 0x04f4 
+MAPUPPER   0x04f7 0x04f6 
 MAPUPPER   0x04f9 0x04f8 
 MAPLOWER   0x0400 - 0x040f : 0x0450 
 MAPLOWER   0x0410 - 0x042f : 0x0430 
@@ -744,6 +746,7 @@
 MAPLOWER   0x04f0 0x04f1 
 MAPLOWER   0x04f2 0x04f3 
 MAPLOWER   0x04f4 0x04f5 
+MAPLOWER   0x04f6 0x04f7 
 MAPLOWER   0x04f8 0x04f9 
 
 
@@ -1052,7 +1055,8 @@
 GRAPH 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 PUNCT 0x0e3f  0x0e4f  0x0e5a  0x0e5b
 PRINT 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
-SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
+SWIDTH0   0x0e31   0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
+SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
 
 TODIGIT0x0e50 - 0x0e59 : 0x 
 
@@ -1283,6 +1287,14 @@
 
 TODIGIT0x1810 - 0x1819 : 0x 
 
+/*
+ * U+1DC0 - U+1DFF : Combining Diacritical Marks Supplement
+ */
+
+GRAPH 0x1DC0 - 0x1DC3
+PRINT 0x1DC0 - 0x1DC3
+SWIDTH0   0x1DC0 - 0x1DC3
+
 
 /*
  * U+1E00 - U+1EFF : Latin Extended Additional
@@ -1672,7 +1684,8 @@
 BLANK 0x2000 - 0x200b  0x202f  0x205f
 PRINT 0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
 PRINT 0x205f
-SWIDTH1   0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH1   0x2000 - 0x200a  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH0   0x200b - 0x200d
 SWIDTH1   0x205f
 
 
@@ -1707,9 +1720,9 @@
  * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols
  */
 
-GRAPH 0x20d0 - 0x20ea
-PRINT 0x20d0 - 0x20ea
-SWIDTH1   0x20d0 - 0x20ea
+GRAPH 0x20d0 - 0x20eb
+PRINT 0x20d0 - 0x20eb
+SWIDTH0   0x20d0 - 0x20eb
 
 
 /*
@@ -1987,7 +2000,8 @@
 PUNCT 0x309b  0x309c
 PRINT 0x3041 - 0x3096  0x3099 - 0x309f
 PHONOGRAM 0x3041 - 0x3096  0x309f
-SWIDTH2   0x3041 - 0x3096  0x3099 - 0x309f
+SWIDTH0   0x3099 - 0x309a
+SWIDTH2   0x3041 - 0x3096  0x309b - 0x309f
 
 
 /*
@@ -2211,7 +2225,7 @@
 
 GRAPH 0xfe20 - 0xfe23
 PRINT 0xfe20 - 0xfe23
-SWIDTH1   0xfe20 - 0xfe23
+SWIDTH0   0xfe20 - 0xfe23
 
 
 /*
@@ -2333,8 +2347,13 @@
 GRAPH 0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
 PUNCT 0x1d100 - 0x1d126  0x1d12a - 0x1d164  0x1d16a - 0x1d16c
 PUNCT 0x1d183  0x1d184  0x1d18c - 0x1d1a9  0x1d1ae - 0x1d1dd
-PRINT 0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
-SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
+PRINT 0x1d100 - 0x1d126  0x1d12a - 0x1d158  0x1d15a - 0x1d172
+PRINT 0x1d17b - 0x1d1dd
+SWIDTH0   0x1d165 

Re: Update UTF-8 locale ctype data (was: Re: ls(1) multibyte support)

2011-01-14 Thread Stefan Sperling
On Fri, Jan 14, 2011 at 05:21:46PM +0100, Stefan Sperling wrote:
 On Thu, Jan 06, 2011 at 07:52:19PM +0300, Alexander Polakov wrote:
  * Alexander Polakov polac...@gmail.com [110105 17:20]:
   Hi,
   
   here's an updated version.
   
   1) en_US.UTF-8.src updates from FreeBSD
 
 Let's start with those.
 
 These changes are all fine, I checked them against Unicode 5.2.
 http://www.unicode.org/Public/5.2.0/charts/CodeCharts-noHan.pdf
 
 The diff below (from Alexander) brings us up to par with FreeBSD.
 Many updates could be made to this file to support additional
 characters listed in Unicode 5.2.0 (or even 6.0.0).
 But that can be done later.
 
 Can someone ok this? Thanks in advance.

Before the ctype changes can go in, we'll need to this part from
Alexander's diff to fix mklocale (caught by nicm@, thanks!)

These symbols are internal to libc, with exception of mklocale.
Can this go in during ABI lock?

Index: lib/libc/locale/runetype.h
===
RCS file: /OpenBSD/src/lib/libc/locale/runetype.h,v
retrieving revision 1.5
diff -u -r1.5 runetype.h
--- lib/libc/locale/runetype.h  8 Oct 2007 08:17:15 -   1.5
+++ lib/libc/locale/runetype.h  6 Jan 2011 16:24:20 -
@@ -69,9 +69,9 @@
 #define_RUNETYPE_I 0x0008U /* Ideogram */
 #define_RUNETYPE_T 0x0010U /* Special */
 #define_RUNETYPE_Q 0x0020U /* Phonogram */
-#define_RUNETYPE_SWM   0xc000U/* Mask to get screen width data */
+#define_RUNETYPE_SWM   0xe000U /* Mask to get screen width 
data */
 #define_RUNETYPE_SWS   30  /* Bits to shift to get width */
-#define_RUNETYPE_SW0   0xU /* 0 width character */
+#define_RUNETYPE_SW0   0x2000U /* 0 width character */
 #define_RUNETYPE_SW1   0x4000U /* 1 width character */
 #define_RUNETYPE_SW2   0x8000U /* 2 width character */
 #define_RUNETYPE_SW3   0xc000U /* 3 width character */



Re: ls(1) multibyte support

2011-01-06 Thread Alexander Polakov
* Alexander Polakov polac...@gmail.com [110105 17:20]:
 Hi,
 
 here's an updated version.
 
 1) en_US.UTF-8.src updates from FreeBSD
 2) wcwidth() changed to use the same code as iswprint()
* maybe just use iswprint() itself?
 3) _RUNETYPE_SW0 changed to be !0 (and match FreeBSD). 0 value is used in
mklocale to perform additional checks required for MAPLOWER and
MAPUPPER, but not SWIDTHx.

  4) _RUNETYPE_SWM changed to make (r_RUNETYPE_SWM) == _RUNETYPE_SW0
  work


Index: lib/libc/locale/iswctype.c
===
RCS file: /OpenBSD/src/lib/libc/locale/iswctype.c,v
retrieving revision 1.1
diff -u -r1.1 iswctype.c
--- lib/libc/locale/iswctype.c  7 Aug 2005 10:16:23 -   1.1
+++ lib/libc/locale/iswctype.c  6 Jan 2011 16:24:20 -
@@ -170,7 +170,9 @@
 int
 wcwidth(wchar_t c)
 {
-return (((unsigned)__runetype_w(c)  _CTYPE_SWM)  _CTYPE_SWS);
+   if (__isctype_w((c), _CTYPE_R))
+   return (((unsigned)__runetype_w(c)  _CTYPE_SWM)  _CTYPE_SWS);
+   return -1;
 }
 
 wctrans_t
Index: lib/libc/locale/runetype.h
===
RCS file: /OpenBSD/src/lib/libc/locale/runetype.h,v
retrieving revision 1.5
diff -u -r1.5 runetype.h
--- lib/libc/locale/runetype.h  8 Oct 2007 08:17:15 -   1.5
+++ lib/libc/locale/runetype.h  6 Jan 2011 16:24:20 -
@@ -69,9 +69,9 @@
 #define_RUNETYPE_I 0x0008U /* Ideogram */
 #define_RUNETYPE_T 0x0010U /* Special */
 #define_RUNETYPE_Q 0x0020U /* Phonogram */
-#define_RUNETYPE_SWM   0xc000U/* Mask to get screen width data */
+#define_RUNETYPE_SWM   0xe000U /* Mask to get screen width 
data */
 #define_RUNETYPE_SWS   30  /* Bits to shift to get width */
-#define_RUNETYPE_SW0   0xU /* 0 width character */
+#define_RUNETYPE_SW0   0x2000U /* 0 width character */
 #define_RUNETYPE_SW1   0x4000U /* 1 width character */
 #define_RUNETYPE_SW2   0x8000U /* 2 width character */
 #define_RUNETYPE_SW3   0xc000U /* 3 width character */
Index: share/locale/ctype/en_US.UTF-8.src
===
RCS file: /OpenBSD/src/share/locale/ctype/en_US.UTF-8.src,v
retrieving revision 1.1
diff -u -r1.1 en_US.UTF-8.src
--- share/locale/ctype/en_US.UTF-8.src  7 Aug 2005 10:03:45 -   1.1
+++ share/locale/ctype/en_US.UTF-8.src  6 Jan 2011 16:24:39 -
@@ -491,9 +491,9 @@
  * U+0300 - U+036F : Combining Diacritical Marks
  */
 
-GRAPH 0x0300 - 0x034f  0x0360 - 0x036f
-PRINT 0x0300 - 0x034f  0x0360 - 0x036f
-SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
+GRAPH 0x0300 - 0x034e  0x0350 - 0x036f
+PRINT 0x0300 - 0x034e  0x0350 - 0x036f
+SWIDTH0   0x0300 - 0x034e  0x0350 - 0x036f
 
 MAPUPPER   0x0345 0x0399 
 
@@ -583,7 +583,7 @@
 LOWER 0x04c8  0x04ca  0x04cc  0x04ce  0x04d1  0x04d3  0x04d5
 LOWER 0x04d7  0x04d9  0x04db  0x04dd  0x04df  0x04e1  0x04e3
 LOWER 0x04e5  0x04e7  0x04e9  0x04eb  0x04ed  0x04ef  0x04f1
-LOWER 0x04f3  0x04f5  0x04f9
+LOWER 0x04f3  0x04f5  0x04f7  0x04f9
 PUNCT 0x0482
 UPPER 0x0400 - 0x042f  0x0460  0x0462  0x0464  0x0466  0x0468
 UPPER 0x046a  0x046c  0x046e  0x0470  0x0472  0x0474  0x0476
@@ -595,9 +595,10 @@
 UPPER 0x04c5  0x04c7  0x04c9  0x04cb  0x04cd  0x04d0  0x04d2
 UPPER 0x04d4  0x04d6  0x04d8  0x04da  0x04dc  0x04de  0x04e0
 UPPER 0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
-UPPER 0x04f0  0x04f2  0x04f4  0x04f8
-PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
-SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+UPPER 0x04f0  0x04f2  0x04f4  0x04f6  0x04f8
+PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f9
+SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
+SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f9
 
 MAPUPPER   0x0430 - 0x044f : 0x0410 
 MAPUPPER   0x0450 - 0x045f : 0x0400 
@@ -671,6 +672,7 @@
 MAPUPPER   0x04f1 0x04f0 
 MAPUPPER   0x04f3 0x04f2 
 MAPUPPER   0x04f5 0x04f4 
+MAPUPPER   0x04f7 0x04f6 
 MAPUPPER   0x04f9 0x04f8 
 MAPLOWER   0x0400 - 0x040f : 0x0450 
 MAPLOWER   0x0410 - 0x042f : 0x0430 
@@ -744,6 +746,7 @@
 MAPLOWER   0x04f0 0x04f1 
 MAPLOWER   0x04f2 0x04f3 
 MAPLOWER   0x04f4 0x04f5 
+MAPLOWER   0x04f6 0x04f7 
 MAPLOWER   0x04f8 0x04f9 
 
 
@@ -1052,7 +1055,8 @@
 GRAPH 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 PUNCT 0x0e3f  0x0e4f  0x0e5a  0x0e5b
 PRINT 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
-SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
+SWIDTH0   0x0e31   0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
+SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
 
 TODIGIT0x0e50 - 0x0e59 : 0x 
 
@@ -1283,6 +1287,14 @@
 
 TODIGIT0x1810 - 0x1819 : 0x 
 
+/*
+ * U+1DC0 - U+1DFF : Combining Diacritical Marks 

Re: ls(1) multibyte support

2011-01-05 Thread Alexander Polakov
Hi,

here's an updated version.

1) en_US.UTF-8.src updates from FreeBSD
2) wcwidth() changed to use the same code as iswprint()
   * maybe just use iswprint() itself?
3) _RUNETYPE_SW0 changed to be !0 (and match FreeBSD). 0 value is used in
   mklocale to perform additional checks required for MAPLOWER and
   MAPUPPER, but not SWIDTHx.

--- share/locale/ctype/en_US.UTF-8.src  Wed Jan  5 12:37:22 2011
+++ share/locale/ctype/en_US.UTF-8.src  Wed Jan  5 09:47:56 2011
@@ -491,9 +491,9 @@
  * U+0300 - U+036F : Combining Diacritical Marks
  */
 
-GRAPH 0x0300 - 0x034f  0x0360 - 0x036f
-PRINT 0x0300 - 0x034f  0x0360 - 0x036f
-SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
+GRAPH 0x0300 - 0x034e  0x0350 - 0x036f
+PRINT 0x0300 - 0x034e  0x0350 - 0x036f
+SWIDTH0   0x0300 - 0x034e  0x0350 - 0x036f
 
 MAPUPPER   0x0345 0x0399 
 
@@ -583,7 +583,7 @@
 LOWER 0x04c8  0x04ca  0x04cc  0x04ce  0x04d1  0x04d3  0x04d5
 LOWER 0x04d7  0x04d9  0x04db  0x04dd  0x04df  0x04e1  0x04e3
 LOWER 0x04e5  0x04e7  0x04e9  0x04eb  0x04ed  0x04ef  0x04f1
-LOWER 0x04f3  0x04f5  0x04f9
+LOWER 0x04f3  0x04f5  0x04f7  0x04f9
 PUNCT 0x0482
 UPPER 0x0400 - 0x042f  0x0460  0x0462  0x0464  0x0466  0x0468
 UPPER 0x046a  0x046c  0x046e  0x0470  0x0472  0x0474  0x0476
@@ -595,9 +595,10 @@
 UPPER 0x04c5  0x04c7  0x04c9  0x04cb  0x04cd  0x04d0  0x04d2
 UPPER 0x04d4  0x04d6  0x04d8  0x04da  0x04dc  0x04de  0x04e0
 UPPER 0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
-UPPER 0x04f0  0x04f2  0x04f4  0x04f8
-PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
-SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+UPPER 0x04f0  0x04f2  0x04f4  0x04f6  0x04f8
+PRINT 0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f9
+SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
+SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f9
 
 MAPUPPER   0x0430 - 0x044f : 0x0410 
 MAPUPPER   0x0450 - 0x045f : 0x0400 
@@ -671,6 +672,7 @@
 MAPUPPER   0x04f1 0x04f0 
 MAPUPPER   0x04f3 0x04f2 
 MAPUPPER   0x04f5 0x04f4 
+MAPUPPER   0x04f7 0x04f6 
 MAPUPPER   0x04f9 0x04f8 
 MAPLOWER   0x0400 - 0x040f : 0x0450 
 MAPLOWER   0x0410 - 0x042f : 0x0430 
@@ -744,6 +746,7 @@
 MAPLOWER   0x04f0 0x04f1 
 MAPLOWER   0x04f2 0x04f3 
 MAPLOWER   0x04f4 0x04f5 
+MAPLOWER   0x04f6 0x04f7 
 MAPLOWER   0x04f8 0x04f9 
 
 
@@ -1052,7 +1055,8 @@
 GRAPH 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 PUNCT 0x0e3f  0x0e4f  0x0e5a  0x0e5b
 PRINT 0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
-SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
+SWIDTH0   0x0e31   0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
+SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
 
 TODIGIT0x0e50 - 0x0e59 : 0x 
 
@@ -1283,7 +1287,15 @@
 
 TODIGIT0x1810 - 0x1819 : 0x 
 
+/*
+ * U+1DC0 - U+1DFF : Combining Diacritical Marks Supplement
+ */
 
+GRAPH 0x1DC0 - 0x1DC3
+PRINT 0x1DC0 - 0x1DC3
+SWIDTH0   0x1DC0 - 0x1DC3
+
+
 /*
  * U+1E00 - U+1EFF : Latin Extended Additional
  */
@@ -1672,7 +1684,8 @@
 BLANK 0x2000 - 0x200b  0x202f  0x205f
 PRINT 0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
 PRINT 0x205f
-SWIDTH1   0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH1   0x2000 - 0x200a  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH0   0x200b - 0x200d
 SWIDTH1   0x205f
 
 
@@ -1707,9 +1720,9 @@
  * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols
  */
 
-GRAPH 0x20d0 - 0x20ea
-PRINT 0x20d0 - 0x20ea
-SWIDTH1   0x20d0 - 0x20ea
+GRAPH 0x20d0 - 0x20eb
+PRINT 0x20d0 - 0x20eb
+SWIDTH0   0x20d0 - 0x20eb
 
 
 /*
@@ -1987,7 +2000,8 @@
 PUNCT 0x309b  0x309c
 PRINT 0x3041 - 0x3096  0x3099 - 0x309f
 PHONOGRAM 0x3041 - 0x3096  0x309f
-SWIDTH2   0x3041 - 0x3096  0x3099 - 0x309f
+SWIDTH0   0x3099 - 0x309a
+SWIDTH2   0x3041 - 0x3096  0x309b - 0x309f
 
 
 /*
@@ -2211,7 +2225,7 @@
 
 GRAPH 0xfe20 - 0xfe23
 PRINT 0xfe20 - 0xfe23
-SWIDTH1   0xfe20 - 0xfe23
+SWIDTH0   0xfe20 - 0xfe23
 
 
 /*
@@ -2333,8 +2347,13 @@
 GRAPH 0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
 PUNCT 0x1d100 - 0x1d126  0x1d12a - 0x1d164  0x1d16a - 0x1d16c
 PUNCT 0x1d183  0x1d184  0x1d18c - 0x1d1a9  0x1d1ae - 0x1d1dd
-PRINT 0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
-SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
+PRINT 0x1d100 - 0x1d126  0x1d12a - 0x1d158  0x1d15a - 0x1d172
+PRINT 0x1d17b - 0x1d1dd
+SWIDTH0   0x1d165 - 0x1d169  0x1d16d - 0x1d172  0x1d17b - 0x1d182
+SWIDTH0   0x1d185 - 0x1d18b  0x1d1aa - 0x1d1ad
+SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d158  0x1d15a - 0x1d164
+SWIDTH1   0x1d16a - 0x1d16c  0x1d183   0x1d184  0x1d18c - 0x1d1a9
+SWIDTH1   0x1d1ae - 0x1d1dd
 
 
 /*
--- lib/libc/locale/iswctype.c.orig Tue Jan  4 23:12:23 2011
+++ lib/libc/locale/iswctype.c  Wed Jan  5 10:02:36 2011
@@ -170,7 +170,9 @@
 int
 wcwidth(wchar_t c)
 {
-return (((unsigned)__runetype_w(c)  _CTYPE_SWM)  

Re: ls(1) multibyte support

2011-01-04 Thread Stefan Sperling
On Tue, Jan 04, 2011 at 09:14:51PM +0300, Alexander Polakov wrote:
 Hi,
 
 I wonder if there any plans on adding multibyte support for ls(1)?
 Or maybe there's a reason why it's not a great idea (which I am not
 aware of)?
 Anyway, here's a patch I have. It's based on DragonFlyBSD's ls.
 

Any locale stuff added to applications that are used on the ramdisk
(bsd.rd) must be inside #ifndef SMALL.
The ls binary is linked statically so we need to prevent it from wasting
space by pulling citrus stuff onto the ramdisk.

More importantly, there is an alleged bug in our wcwidth() implementation.
I haven't had time to investigate, but it has been pointed out on separate
occasions, by Jordi Beltran Creix and by n...@.
Test program (from Jordi):

  #include stdio.h
  #include locale.h
  
  main ()
  {
setlocale(LC_ALL, );
printf(%d %d %d %d\n, wcwidth(0x53DA), wcwidth('A'),
  wcwidth(0x200B), wcwidth(0x1F));
return 0;
  }
  
Output is 2, 1, 1, 0, should be 2, 1, 0, -1 (according to Jordi).

We should make sure that wcwidth() is working properly before changing
applications to use it. We also need a wcwidth() man page.

FWIW, below is a diff that Jordi sent me some time ago to fix ls(1).
It also depends on wcwidth().

Index: ls.c
===
RCS file: /cvs/src/bin/ls/ls.c,v
retrieving revision 1.35
diff -u -p -r1.35 ls.c
--- ls.c27 Oct 2009 23:59:21 -  1.35
+++ ls.c7 Aug 2010 09:16:03 -
@@ -48,6 +48,8 @@
 #include string.h
 #include unistd.h
 #include util.h
+#include locale.h
+#include wchar.h
 
 #include ls.h
 #include extern.h
@@ -102,6 +104,10 @@ ls_main(int argc, char *argv[])
int kflag = 0;
char *p;
 
+#ifndef SMALL
+   setlocale(LC_ALL, );
+
+#endif
/* Terminal defaults to -Cq, non-terminal defaults to -1. */
if (isatty(STDOUT_FILENO)) {
if ((p = getenv(COLUMNS)) != NULL)
@@ -396,6 +402,32 @@ traverse(int argc, char *argv[], int opt
err(1, fts_read);
 }
 
+#ifndef SMALL
+static int
+mbswidth(const char *s)
+{
+   wchar_t wc;
+   size_t wclen;
+   mbstate_t mbs;
+   int width = 0;
+
+   bzero(mbs, sizeof(mbs));
+
+   while (*s) {
+   wclen = mbrtowc(wc, s, MB_CUR_MAX, mbs);
+   if (wclen  0 || !iswprint(wc)) {
+   if (wclen  0)
+   wclen = 1;
+   width++;
+   } else {
+   width += wcwidth(wc);
+   }
+   s += wclen;
+   }
+   return width;
+}
+#endif
+
 /*
  * Display() takes a linked list of FTSENT structures and passes the list
  * along with any other necessary information to the print function.  P
@@ -458,8 +490,13 @@ display(FTSENT *p, FTSENT *list)
continue;
}
}
+#ifndef SMALL
+   if (mbswidth(cur-fts_name)  maxlen)
+   maxlen = mbswidth(cur-fts_name);
+#else
if (cur-fts_namelen  maxlen)
maxlen = cur-fts_namelen;
+#endif
if (needstats) {
sp = cur-fts_statp;
if (sp-st_blocks  maxblock)
Index: util.c
===
RCS file: /cvs/src/bin/ls/util.c,v
retrieving revision 1.14
diff -u -p -r1.14 util.c
--- util.c  27 Oct 2009 23:59:21 -  1.14
+++ util.c  7 Aug 2010 09:16:03 -
@@ -41,6 +41,7 @@
 #include stdio.h
 #include stdlib.h
 #include string.h
+#include wchar.h
 
 #include ls.h
 #include extern.h
@@ -49,9 +50,26 @@ int
 putname(char *name)
 {
int len;
-
+#ifndef SMALL
+   size_t wclen;
+   wchar_t wc;
+   mbstate_t mbs;
+   
+   bzero(mbs, sizeof(mbs));
+   for (len = 0; *name; len += wcwidth(wc), name += wclen) {
+   wclen=mbrtowc(wc, name, MB_CUR_MAX, mbs);
+   if (wclen  0) {
+   wclen = 1;
+   wc = '?';
+   } else {
+   wc = (!iswprint(wc)  f_nonprint) ? '?' : wc;
+   }
+   putwchar(wc);
+   }
+#else
for (len = 0; *name; len++, name++)
putchar((!isprint(*name)  f_nonprint) ? '?' : *name);
+#endif
return len;
 }



Re: ls(1) multibyte support

2011-01-04 Thread Alexander Polakov
* Stefan Sperling s...@stsp.name [110104 23:12]:
 On Tue, Jan 04, 2011 at 09:14:51PM +0300, Alexander Polakov wrote:
  Hi,
  
  I wonder if there any plans on adding multibyte support for ls(1)?
  Or maybe there's a reason why it's not a great idea (which I am not
  aware of)?
  Anyway, here's a patch I have. It's based on DragonFlyBSD's ls.
  
 
 Any locale stuff added to applications that are used on the ramdisk
 (bsd.rd) must be inside #ifndef SMALL.
 The ls binary is linked statically so we need to prevent it from wasting
 space by pulling citrus stuff onto the ramdisk.

Sure.
 
 More importantly, there is an alleged bug in our wcwidth() implementation.
 I haven't had time to investigate, but it has been pointed out on separate
 occasions, by Jordi Beltran Creix and by n...@.
 Test program (from Jordi):
 
   #include stdio.h
   #include locale.h
   
   main ()
   {
   setlocale(LC_ALL, );
   printf(%d %d %d %d\n, wcwidth(0x53DA), wcwidth('A'),
   wcwidth(0x200B), wcwidth(0x1F));
   return 0;
   }
   
 Output is 2, 1, 1, 0, should be 2, 1, 0, -1 (according to Jordi).
 
 We should make sure that wcwidth() is working properly before changing
 applications to use it. We also need a wcwidth() man page.

I think there're 2 separate bugs and I have 2 fixes (neither one
tested).

1) wcwidth(0x200B)
This if from http://unicode.org/Public/UNIDATA/ :

200B;ZERO WIDTH SPACE;Cf;0;BN;N;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;N;
200D;ZERO WIDTH JOINER;Cf;0;BN;N;

--- share/locale/ctype/en_US.UTF-8.src.orig Tue Jan  4 22:49:22 2011
+++ share/locale/ctype/en_US.UTF-8.src  Tue Jan  4 22:50:55 2011
@@ -1672,7 +1672,8 @@
 BLANK 0x2000 - 0x200b  0x202f  0x205f
 PRINT 0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
 PRINT 0x205f
-SWIDTH1   0x2000 - 0x200b  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH1   0x2000 - 0x200c  0x2010 - 0x2029  0x202f - 0x2052  0x2057
+SWIDTH0   0x200b - 0x200d
 SWIDTH1   0x205f
 

2) wcwidth(0x1f)

DragonFly's man page for wcwidth(3) says that function returns -1 if 
character is not printable. _RUNETYPE_R is the flag to check.

--- lib/libc/locale/iswctype.c.orig Tue Jan  4 23:12:23 2011
+++ lib/libc/locale/iswctype.c  Tue Jan  4 23:02:37 2011
@@ -170,7 +170,11 @@
 int
 wcwidth(wchar_t c)
 {
-return (((unsigned)__runetype_w(c)  _CTYPE_SWM)  _CTYPE_SWS);
+   _RuneType r;
+   r = __runetype_w(c);
+   if (r  _RUNETYPE_R)
+   return (((unsigned)r  _CTYPE_SWM)  _CTYPE_SWS);
+   return -1;
 }
 
 wctrans_t

Again, I don't have hardware at hand to build libc so this is completely
untested.



Re: ls(1) multibyte support

2011-01-04 Thread Jordi Beltran Creix
2011/1/5 Alexander Polakov polac...@gmail.com:
 1) wcwidth(0x200B)
 This if from http://unicode.org/Public/UNIDATA/ :

 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;N;
 200D;ZERO WIDTH JOINER;Cf;0;BN;N;

 --- share/locale/ctype/en_US.UTF-8.src.orig B  B  Tue Jan B 4 22:49:22 2011
 +++ share/locale/ctype/en_US.UTF-8.src B Tue Jan B 4 22:50:55 2011
 @@ -1672,7 +1672,8 @@
 B BLANK B  B  0x2000 - 0x200b B 0x202f B 0x205f
 B PRINT B  B  0x2000 - 0x200b B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057
 B PRINT B  B  0x205f
 -SWIDTH1 B  0x2000 - 0x200b B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057
 +SWIDTH1 B  0x2000 - 0x200c B 0x2010 - 0x2029 B 0x202f - 0x2052 B 0x2057
 +SWIDTH0 B  0x200b - 0x200d
 B SWIDTH1 B  0x205f

That only solves the test case. All combining characters(diacritic
marks), including 0x300, should be 0 width as well.

Accepted interpretation of Unicode rules appears to be that Cf, Me and
Mf categories +- a few characters are to be 0-spaced, see the comments
in:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

That file also happens to be in xenocara/app/xterm/wcwidth.c so that
was the behavior in xterm until(I assume) it started using the system
version.

The database file in OpenBSD is just too old, the same problem file
was fixed in FreeBSD in 2006, see:
http://code.bsd64.org/cvsweb/freebsd/src/share/mklocale/UTF-8.src