Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-02-04 Thread Vincent Lefevre
On 2007-01-19 03:43:02 +0100, Vincent Lefevre wrote:
 On 2007-01-18 17:39:40 +0100, Bruno Haible wrote:
  Vincent, do you have time to report that to the Apple people? No need to
  mention 'ls' - a simple
  
printf 'E\xcc\x81\t2nd column\nFoo\t2nd column\n'
  
  should be all you need to demonstrate the bug. I'm not in such a good
  position to report it, since I'm using an older version of MacOS X.
 
 Done. FYI, the ID is 4940781 (but since the bug reports are not public,
 I doubt this ID is useful). However I have reported several bugs for
 more than a year, and none of them are fixed.

Fixed by Apple.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-19 Thread Jim Meyering
Vincent Lefevre [EMAIL PROTECTED] wrote:
 On 2007-01-19 01:23:44 +0100, Bruno Haible wrote:
 Apple Terminal version 1.4.6, part of MacOS X 10.3.9, is affected.

 I forgot to say. This is still not fixed in Terminal 1.5 (133) from
 Mac OS X 10.4.8.

Thanks.
I've checked this in:

* coreutils.texi (ls: General output formatting): Mention the
workarounds to accommodate the Apple Terminal bug.

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 6fc6704..89e97d8 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -6419,6 +6419,13 @@ Assume that each tab stop is @var{cols} columns wide.  
The default is 8.
 @command{ls} uses tabs where possible in the output, for efficiency.  If
 @var{cols} is zero, do not use tabs at all.

[EMAIL PROTECTED] FIXME: remove in 2009, if Apple Terminal has been fixed for 
long enough.
+Some terminal emulators (at least Apple Terminal 1.5 (133) from Mac OS X 
10.4.8)
+do not properly align columns to the right of a TAB following a
[EMAIL PROTECTED] byte.  If you use such a terminal emulator, use the
[EMAIL PROTECTED] option or put @code{TABSIZE=0} in your environment to tell
[EMAIL PROTECTED] to align using spaces, not tabs.
+
 @item -w
 @itemx [EMAIL PROTECTED]
 @opindex -w


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread sci-fi

On 2007-01-15 21:05:53 -0600, Vincent Lefevre [EMAIL PROTECTED] said:


Hi,

Under Mac OS X 10.4.8 with ls (GNU coreutils) 5.97 (installed via
MacPorts), in a 80-column terminal (uxterm), I get:

$ ls
É   y123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

instead of:

$ ls
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

Note:

$ locale
LANG=POSIX
LC_COLLATE=POSIX
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=POSIX
LC_MONETARY=POSIX
LC_NUMERIC=POSIX
LC_TIME=POSIX
LC_ALL=POSIX/en_US.UTF-8/POSIX/POSIX/POSIX/POSIX

Regards,


How to reproduce, please?

Does changing the Apple Terminal Window Settings aka Terminal Inspector
help?  In particular, select the tab named Display, and try the first
three checkmarks under the Text Font section there.  Sometimes the
Anti-Alias setting is enough to push the width of the character cell
over to make the rest of the printed line line-up properly.  The next
two checkmarks are for wide glyphs, sometimes Terminal needs to be
fooled with these settings for accented chars anyway.

How does iTerm behave?  They've been working on some enhancements of
their own (nevermind Apple ;) ).


--





___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Bruno Haible
Vincent Lefevre wrote:
 Hmm... I forgot that ls was an alias (the same one on all my accounts).
 So, back on Mac OS X:
 
 prunille:~/blah \ls -C --color=always | hexdump -C
   1b 5b 30 30 6d 1b 5b 30  6d 45 cc 81 1b 5b 30 30  |.[00m.[0mE�..[00|
 0010  6d 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |m   |
 0020  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  ||
 0030  1b 5b 30 6d 79 31 32 33  34 35 36 37 38 39 30 31  |.[0my12345678901|
 0040  32 33 34 35 36 37 38 39  30 31 32 33 34 35 36 37  |2345678901234567|
 0050  38 39 30 1b 5b 30 30 6d  0a 1b 5b 30 6d 78 31 32  |890.[00m..[0mx12|
 0060  33 34 35 36 37 38 39 30  31 32 33 34 35 36 37 38  |3456789012345678|
 0070  39 30 31 32 33 34 35 36  37 38 39 30 1b 5b 30 30  |901234567890.[00|
 0080  6d 20 20 1b 5b 30 6d 7a  31 32 33 34 35 36 37 38  |m  .[0mz12345678|
 0090  39 30 31 32 33 34 35 36  37 38 39 30 31 32 33 34  |9012345678901234|
 00a0  35 36 37 38 39 30 1b 5b  30 30 6d 0a 1b 5b 6d |567890.[00m..[m|
 00af

That makes - except for the escape sequences - an E, a combining accent and
31 spaces. So it's the same bug as in ls -C -T0.

  I see that the first call to wcwidth() gives: wcwidth(0x0301) = 1.
  U+0301 is COMBINING ACUTE ACCENT. So here is the problem: MacOS'
  wcwidth is buggy for combining characters like accents.
 
 OK. Can't autoconf detect that and use another implementation?

Yes. We can do that in gnulib. I'll work on this issue in the next few weeks.
Please remind us (on the bug-gnulib mailing list) in 1 or 2 months.

And, as we have seen, the other issue is that Apple Terminal has problems
estimating the width of tabs when there are non-ASCII characters. Since
you can start an telnet/ssh session from MacOS X to any platform (Linux,
Solaris, etc.), the fix needs to be platform independent. Here is such a fix:


2007-01-18  Bruno Haible  [EMAIL PROTECTED]

Avoid problems with tabs after non-ASCII characters in some terminals.
* src/ls.c (nonascii_in_this_line): New variable.
(quote_name): Update nonascii_in_this_line.
(print_many_per_line, print_horizontal): Set nonascii_in_this_line to
false at the beginning of each line.
(indent): Use spaces for indentation when nonascii_in_this_line.

diff -c -3 -r1.447 ls.c
*** src/ls.c2 Jan 2007 06:29:12 -   1.447
--- src/ls.c18 Jan 2007 14:38:14 -
***
*** 851,856 
--- 851,859 
 for the separating white space.  */
  #define MIN_COLUMN_WIDTH  3
  
+ /* True if some non-ASCII character has been output on this line.  */
+ static bool nonascii_in_this_line;
+ 
  
  /* This zero-based index is used solely with the --dired option.
 When that option is in effect, this counter is incremented for each
***
*** 3704,3710 
  }
  
if (out != NULL)
! fwrite (buf, 1, len, out);
if (width != NULL)
  *width = displayed_width;
return len;
--- 3702,3722 
  }
  
if (out != NULL)
! {
!   /* Update nonascii_in_this_line indicator.  */
!   char const *p = buf;
!   char const *plimit = buf + len;
! 
!   for (; p  plimit; p++)
!   if (!isascii (to_uchar (*p)))
! {
!   nonascii_in_this_line = true;
!   break;
! }
! 
!   /* Actually output the quoted representation.  */
!   fwrite (buf, 1, len, out);
! }
if (width != NULL)
  *width = displayed_width;
return len;
***
*** 3957,3962 
--- 3969,3975 
size_t pos = 0;
  
/* Print the next row.  */
+   nonascii_in_this_line = false;
while (1)
{
  size_t name_length = length_of_file_name_and_frills (files + filesno);
***
*** 3984,3989 
--- 3997,4004 
size_t name_length = length_of_file_name_and_frills (files);
size_t max_name_length = line_fmt-col_arr[0];
  
+   nonascii_in_this_line = false;
+ 
/* Print first entry.  */
print_file_name_and_frills (files);
  
***
*** 3996,4001 
--- 4011,4017 
{
  putchar ('\n');
  pos = 0;
+ nonascii_in_this_line = false;
}
else
{
***
*** 4047,4060 
  }
  
  /* Assuming cursor is at position FROM, indent up to position TO.
!Use a TAB character instead of two or more spaces whenever possible.  */
  
  static void
  indent (size_t from, size_t to)
  {
while (from  to)
  {
!   if (tabsize != 0  to / tabsize  (from + 1) / tabsize)
{
  putchar ('\t');
  from += tabsize - from % tabsize;
--- 4063,4085 
  }
  
  /* Assuming cursor is at position FROM, indent up to position TO.
!Use a TAB character instead of two or more spaces whenever possible.
!Depends on the TABSIZE option and on the current value of
!NONASCII_IN_THIS_LINE.  */
  
  static void
  indent (size_t from, size_t to)
  {
while (from  

Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote:
 Vincent Lefevre wrote:
...
  I see that the first call to wcwidth() gives: wcwidth(0x0301) = 1.
  U+0301 is COMBINING ACUTE ACCENT. So here is the problem: MacOS'
  wcwidth is buggy for combining characters like accents.

 OK. Can't autoconf detect that and use another implementation?

 Yes. We can do that in gnulib. I'll work on this issue in the next few weeks.
 Please remind us (on the bug-gnulib mailing list) in 1 or 2 months.

Thanks for volunteering to do that.

 And, as we have seen, the other issue is that Apple Terminal has problems
 estimating the width of tabs when there are non-ASCII characters. Since
 you can start an telnet/ssh session from MacOS X to any platform (Linux,
 Solaris, etc.), the fix needs to be platform independent. Here is such a fix:

 2007-01-18  Bruno Haible  [EMAIL PROTECTED]

   Avoid problems with tabs after non-ASCII characters in some terminals.
   * src/ls.c (nonascii_in_this_line): New variable.
   (quote_name): Update nonascii_in_this_line.
   (print_many_per_line, print_horizontal): Set nonascii_in_this_line to
   false at the beginning of each line.
   (indent): Use spaces for indentation when nonascii_in_this_line.

Thank you for working on this.
As I understand the goal, you'd like to make ls act differently
(outputting spaces, not TABs, for column alignment) on all systems
for each line containing a non-ASCII byte.  The proposed change in
behavior would serve solely to make it so columns line up better when
displaying on a buggy Apple Terminal.

That change would contradict the documentation of -T, but more
importantly, it would make the output significantly larger when there are
wide columns and many lines containing a non-ASCII byte, thus penalizing
all users in order to cater to a buggy terminal emulator.

I would rather simply have someone who cares about Apple Terminal
report the bug, and in the mean time, advise people to use -T0
(or set TABSIZE=0 in their environment) if they care about alignment
when using a buggy version of that particular terminal emulator.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Bruno Haible
Jim Meyering wrote:
 As I understand the goal, you'd like to make ls act differently
 (outputting spaces, not TABs, for column alignment) on all systems
 for each line containing a non-ASCII byte.

Yes, this is what the proposed patch does.

 That change would contradict the documentation of -T

The --color option also has the effect of turning tabs into spaces; yet this
is undocumented. Actually the doc states

 `ls' uses tabs where possible in the output, for efficiency.  If
 COLS is zero, do not use tabs at all.

and the phrase where possible is vague enough. It is not possible to use
tabs with --color, and it is not possible to use tabs after non-ASCII
characters.

 but more 
 importantly, it would make the output significantly larger when there are
 wide columns and many lines containing a non-ASCII byte, thus penalizing
 all users in order to cater to a buggy terminal emulator.

I thought with xterm, as with most terminal emulators, the network transmit
time is negligible compared to the rendering time on the X side. Besides
that, your argument trades correctness of display against efficiency.

 I would rather simply have someone who cares about Apple Terminal
 report the bug, and in the mean time, advise people to use -T0
 (or set TABSIZE=0 in their environment) if they care about alignment
 when using a buggy version of that particular terminal emulator.

Vincent, do you have time to report that to the Apple people? No need to
mention 'ls' - a simple

  printf 'E\xcc\x81\t2nd column\nFoo\t2nd column\n'

should be all you need to demonstrate the bug. I'm not in such a good
position to report it, since I'm using an older version of MacOS X.

Bruno


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote:
 Jim Meyering wrote:
 As I understand the goal, you'd like to make ls act differently
 (outputting spaces, not TABs, for column alignment) on all systems
 for each line containing a non-ASCII byte.

 Yes, this is what the proposed patch does.

 That change would contradict the documentation of -T

 The --color option also has the effect of turning tabs into spaces; yet this
 is undocumented. Actually the doc states

  `ls' uses tabs where possible in the output, for efficiency.  If
  COLS is zero, do not use tabs at all.

 and the phrase where possible is vague enough. It is not possible to use
 tabs with --color, and it is not possible to use tabs after non-ASCII
 characters.

Um... it *is* possible to use TABs after non-ASCII bytes and get correct
alignment.  The only requirement is that you be using a reasonable
(non-buggy) terminal emulator.

 but more
 importantly, it would make the output significantly larger when there are
 wide columns and many lines containing a non-ASCII byte, thus penalizing
 all users in order to cater to a buggy terminal emulator.

 I thought with xterm, as with most terminal emulators, the network transmit
 time is negligible compared to the rendering time on the X side. Besides
 that, your argument trades correctness of display against efficiency.

Not at all.  I merely refuse to pessimize ls output for everyone,
solely to accommodate some currently buggy version of Apple Terminal.

 I would rather simply have someone who cares about Apple Terminal
 report the bug, and in the mean time, advise people to use -T0
 (or set TABSIZE=0 in their environment) if they care about alignment
 when using a buggy version of that particular terminal emulator.

Do you really think it would be better to make everyone pay (even a tiny bit)
when there is such an easy work-around?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Bruno Haible
Jim Meyering wrote:
 Um... it *is* possible to use TABs after non-ASCII bytes and get correct
 alignment.  The only requirement is that you be using a reasonable
 (non-buggy) terminal emulator.

Yes, sure. I was only pointing out that the proposed change wouldn't need
a doc change, because the wording in the doc is already vague.

  in the mean time, advise people to use -T0
  (or set TABSIZE=0 in their environment) if they care about alignment
  when using a buggy version of that particular terminal emulator.
 
 Do you really think it would be better to make everyone pay (even a tiny bit)
 when there is such an easy work-around?

Given that
  - Apple Terminal is the default/normal terminal emulator on MacOS X,
  - networking/pipe speed are not critical nowadays (in the times of
internet radio and streaming video),
  - the bug was tricky enough to analyze, that an average user couldn't do
it by himself,
I would say yes in this case.

Bruno


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote:
  in the mean time, advise people to use -T0
  (or set TABSIZE=0 in their environment) if they care about alignment
  when using a buggy version of that particular terminal emulator.

 Do you really think it would be better to make everyone pay (even a tiny bit)
 when there is such an easy work-around?

 Given that
   - Apple Terminal is the default/normal terminal emulator on MacOS X,
   - networking/pipe speed are not critical nowadays (in the times of
 internet radio and streaming video),
   - the bug was tricky enough to analyze, that an average user couldn't do
 it by himself,
 I would say yes in this case.

We disagree.

IMHO, it would be unwise to make such a global sacrifice for
a single, buggy, closed-source terminator emulator.

However, if someone tells me which version of Apple Terminal
is affected, I'll mention the work-around in the coreutils README file.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Bruno Haible
Paul Eggert wrote:
 Long ago I regularly used terminal emulators that mishandled tabs.
 Eventually they got fixed (or I stopped using them).

Long ago I used terminals where the tab stops were customizable, and the
previous user had set them to weird values. At that time, I stopped using
tabs. :-)

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Vincent Lefevre
On 2007-01-18 17:39:40 +0100, Bruno Haible wrote:
 The --color option also has the effect of turning tabs into spaces;
 yet this is undocumented. Actually the doc states
 
  `ls' uses tabs where possible in the output, for efficiency.  If
  COLS is zero, do not use tabs at all.
 
 and the phrase where possible is vague enough. It is not possible
 to use tabs with --color, and it is not possible to use tabs after
 non-ASCII characters.

BTW, it shouldn't use tabs when the output does not correspond to a
terminal. For instance, the user may want to send the file by mail
or may want to indent it. Incorrect results can be obtained if there
are tabs.

A solution could be to have tabsize set to 0 by default. For users
who need 8 (or some other value) because of a slow network (without
compression, since a sequence of spaces should be compressed) could
change its value.

 Vincent, do you have time to report that to the Apple people? No need to
 mention 'ls' - a simple
 
   printf 'E\xcc\x81\t2nd column\nFoo\t2nd column\n'
 
 should be all you need to demonstrate the bug. I'm not in such a good
 position to report it, since I'm using an older version of MacOS X.

Done. FYI, the ID is 4940781 (but since the bug reports are not public,
I doubt this ID is useful). However I have reported several bugs for
more than a year, and none of them are fixed.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-18 Thread Vincent Lefevre
On 2007-01-19 01:23:44 +0100, Bruno Haible wrote:
 Apple Terminal version 1.4.6, part of MacOS X 10.3.9, is affected.

I forgot to say. This is still not fixed in Terminal 1.5 (133) from
Mac OS X 10.4.8.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-17 Thread Bruno Haible
Eric Blake wrote:
 coreutils does not handle multi-byte locales well.

True,

 The problem is that no one has yet written a patch that makes it
 easy to handle multibyte locales without penalizing single-byte locales.

There are patches for multibyte locale support for many of the text
utilities, written in 2001. They are based on the mbchar and mbiter
modules that are now in gnulib. But regardless how they were written,
Jim preferred not to use them:

  - If the code used multibyte functions always, it was too much of a
slowdown compared to the older implementation that worked only for
unibyte locales. Everyone agreed on this.

  - If the code used an

  if (MB_CUR_MAX  1)
... code which uses mb* functions ...
  else
... unibyte code ...


Jim objected that there was too much code duplication between the
multibyte and the unibyte branch.

  - If the code used macros that can expand to multibyte or unibyte
primitives, depending on the situation, one could put the code
that uses these macros into a separate file, say,
fold-subroutines.h, and in the main fold.c write

   #define DO_MULTIBYTE 1
   #include fold-subroutines.h /* defines fold_multibyte */

   #define DO_UNIBYTE 1
   #include fold-subroutines.h /* defines fold_unibyte */

   if (MB_CUR_MAX  1)
 fold_multibyte (...);
   else
 fold_unibyte (...);

Here Jim said that it was too many macros for him.

There has been no progress since then, since noone sees how one can
get all 3 of Jim's requirements simultaneously:

  - Good speed for the unibyte case.
  - No code duplication.
  - No macros.

Bruno


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-17 Thread Bruno Haible
Vincent Lefevre wrote:
  Therefore: can you also show wrong behaviour when you set
  LC_ALL=en_US.UTF-8 ?
 
 Yes:
 
 prunille:~/blah export LC_ALL=en_US.UTF-8
 prunille:~/blah locale
 LANG=POSIX
 LC_COLLATE=en_US.UTF-8
 LC_CTYPE=en_US.UTF-8
 LC_MESSAGES=en_US.UTF-8
 LC_MONETARY=en_US.UTF-8
 LC_NUMERIC=en_US.UTF-8
 LC_TIME=en_US.UTF-8
 LC_ALL=en_US.UTF-8
 prunille:~/blah ls
 É   y123456789012345678901234567890
 x123456789012345678901234567890  z123456789012345678901234567890

On MacOS X 10.3.9 I can reproduce this. Let's look at the hexdump of
ls' output:

1) In an Apple Terminal

2) In an xterm, launched with LC_ALL=en_US.UTF-8 xterm

3) In an xterm running on Linux, with an ssh to MacOS X

In all three cases the output of ls is the same:
$ LC_ALL=en_US.UTF-8 ls -C | hd
00  45 CC 81 09 09 09 09 20 79 31 32 33 34 35 36 37  E.. y1234567
10  38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 33  8901234567890123
20  34 35 36 37 38 39 30 0A 78 31 32 33 34 35 36 37  4567890.x1234567
30  38 39 30 31 32 33 34 35 36 37 38 39 30 31 32 33  8901234567890123
40  34 35 36 37 38 39 30 20 20 7A 31 32 33 34 35 36  4567890  z123456
50  37 38 39 30 31 32 33 34 35 36 37 38 39 30 31 32  7890123456789012
60  33 34 35 36 37 38 39 30 0A   34567890.

You see, it starts with E, the accent - on MacOS X, filenames are
represented in decomposed Unicode form -, 4 tabs and a space. So that
the second column of filenames should start in screen column 33 (where
the leftmost is screen column 0). But the output in the terminal looks
like this:

1) In an Apple Terminal
É   y123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

2), 3)
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

So what you see is that Apple Terminal has problems knowing the width
of combining characters like accents when it expands tabs. If you
tell 'ls' to emit spaces instead of tabs, like this:
  ls -C -T0
or
  TABSIZE=0 ls -C
then the output looks the same in all kinds of terminals.

Conclusion: What you see is not an ls bug, but an Apple Terminal bug
with tabs.

But there is an ls bug:

$ ls -C -T0
É   y123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890
$ ls -C -T0 | hd
00  45 CC 81 20 20 20 20 20 20 20 20 20 20 20 20 20  E.. 
10  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  
20  20 20 79 31 32 33 34 35 36 37 38 39 30 31 32 33y1234567890123
30  34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39  4567890123456789
40  30 0A 78 31 32 33 34 35 36 37 38 39 30 31 32 33  0.x1234567890123
50  34 35 36 37 38 39 30 31 32 33 34 35 36 37 38 39  4567890123456789
60  30 20 20 7A 31 32 33 34 35 36 37 38 39 30 31 32  0  z123456789012
70  33 34 35 36 37 38 39 30 31 32 33 34 35 36 37 38  3456789012345678
80  39 30 0A 90.

What 'ls' here outputs is: an E, a combining accent and 31 spaces - text
that moves to column 32, not 33. When I set a breakpoint in wcwidth,
I see that the first call to wcwidth() gives: wcwidth(0x0301) = 1.
U+0301 is COMBINING ACUTE ACCENT. So here is the problem: MacOS'
wcwidth is buggy for combining characters like accents.

Bruno


(*) 'hd' is a shell script:
#!/bin/sh
hexdump -e '%06.6_ax   16/1 %02X ' -e '   16/1 %_p \n' $@



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-17 Thread Vincent Lefevre
On 2007-01-18 03:14:37 +0100, Bruno Haible wrote:
 Conclusion: What you see is not an ls bug, but an Apple Terminal bug
 with tabs.

I don't use the Apple Terminal (and never use it). As I said in my
bug report, I'm using uxterm here. More precisely:

prunille:~ uxterm -version
XFree86 4.3.99.903(184)

With the same uxterm, after a ssh to a Linux machine:

vin:~tmp/blah LC_ALL=en_US.UTF-8 \ls -C | hd
  45 cc 81 09 09 09 09 20  79 31 32 33 34 35 36 37  |E.. y1234567|
0010  38 39 30 31 32 33 34 35  36 37 38 39 30 31 32 33  |8901234567890123|
0020  34 35 36 37 38 39 30 0a  78 31 32 33 34 35 36 37  |4567890.x1234567|
0030  38 39 30 31 32 33 34 35  36 37 38 39 30 31 32 33  |8901234567890123|
0040  34 35 36 37 38 39 30 20  20 7a 31 32 33 34 35 36  |4567890  z123456|
0050  37 38 39 30 31 32 33 34  35 36 37 38 39 30 31 32  |7890123456789012|
0060  33 34 35 36 37 38 39 30  0a   |34567890.|
0069
vin:~tmp/blah LC_ALL=en_US.UTF-8 \ls -C
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

No problem.

Hmm... I forgot that ls was an alias (the same one on all my accounts).
So, back on Mac OS X:

prunille:~/blah \ls
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890
prunille:~/blah \ls --color=always
É   y123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

prunille:~/blah \ls -C | hexdump -C
  45 cc 81 09 09 09 09 20  79 31 32 33 34 35 36 37  |E�. y1234567|
0010  38 39 30 31 32 33 34 35  36 37 38 39 30 31 32 33  |8901234567890123|
0020  34 35 36 37 38 39 30 0a  78 31 32 33 34 35 36 37  |4567890.x1234567|
0030  38 39 30 31 32 33 34 35  36 37 38 39 30 31 32 33  |8901234567890123|
0040  34 35 36 37 38 39 30 20  20 7a 31 32 33 34 35 36  |4567890  z123456|
0050  37 38 39 30 31 32 33 34  35 36 37 38 39 30 31 32  |7890123456789012|
0060  33 34 35 36 37 38 39 30  0a   |34567890.|
0069

prunille:~/blah \ls -C --color=always | hexdump -C
  1b 5b 30 30 6d 1b 5b 30  6d 45 cc 81 1b 5b 30 30  |.[00m.[0mE�..[00|
0010  6d 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |m   |
0020  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  ||
0030  1b 5b 30 6d 79 31 32 33  34 35 36 37 38 39 30 31  |.[0my12345678901|
0040  32 33 34 35 36 37 38 39  30 31 32 33 34 35 36 37  |2345678901234567|
0050  38 39 30 1b 5b 30 30 6d  0a 1b 5b 30 6d 78 31 32  |890.[00m..[0mx12|
0060  33 34 35 36 37 38 39 30  31 32 33 34 35 36 37 38  |3456789012345678|
0070  39 30 31 32 33 34 35 36  37 38 39 30 1b 5b 30 30  |901234567890.[00|
0080  6d 20 20 1b 5b 30 6d 7a  31 32 33 34 35 36 37 38  |m  .[0mz12345678|
0090  39 30 31 32 33 34 35 36  37 38 39 30 31 32 33 34  |9012345678901234|
00a0  35 36 37 38 39 30 1b 5b  30 30 6d 0a 1b 5b 6d |567890.[00m..[m|
00af

 But there is an ls bug:
 
 $ ls -C -T0
 É   y123456789012345678901234567890
 x123456789012345678901234567890  z123456789012345678901234567890
 $ ls -C -T0 | hd
 00  45 CC 81 20 20 20 20 20 20 20 20 20 20 20 20 20  E.. 
 10  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  
 20  20 20 79 31 32 33 34 35 36 37 38 39 30 31 32 33y1234567890123
[...]

OK, so I think I was seeing this bug.

 What 'ls' here outputs is: an E, a combining accent and 31 spaces - text
 that moves to column 32, not 33. When I set a breakpoint in wcwidth,
 I see that the first call to wcwidth() gives: wcwidth(0x0301) = 1.
 U+0301 is COMBINING ACUTE ACCENT. So here is the problem: MacOS'
 wcwidth is buggy for combining characters like accents.

OK. Can't autoconf detect that and use another implementation?

 (*) 'hd' is a shell script:
 #!/bin/sh
 hexdump -e '%06.6_ax   16/1 %02X ' -e '   16/1 %_p \n' $@

It's a bit like (or identical to) hexdump -C, then.

Regards,

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-16 Thread Vincent Lefevre
On 2007-01-15 22:29:41 -0800, Paul Eggert wrote:
 Most likely this has something to do with how mbrtowc and/or
 wcwidth behaves on MacOS X.  Perhaps you can debug the quote_name
 function of 'ls' on the affected file name, and see why it's
 computing the width that it's computing?

First, do you know any freely available test suite for functions such
as mbrtowc and wcwidth? It would be easier to know where the problem
is.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-16 Thread Andreas Schwab
Vincent Lefevre [EMAIL PROTECTED] writes:

 First, do you know any freely available test suite for functions such
 as mbrtowc and wcwidth? It would be easier to know where the problem
 is.

There are some tests in glibc.  For most of them it should be possible to
run them standalone, too.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-15 Thread Vincent Lefevre
Hi,

Under Mac OS X 10.4.8 with ls (GNU coreutils) 5.97 (installed via
MacPorts), in a 80-column terminal (uxterm), I get:

$ ls
É   y123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

instead of:

$ ls
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

Note:

$ locale
LANG=POSIX
LC_COLLATE=POSIX
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=POSIX
LC_MONETARY=POSIX
LC_NUMERIC=POSIX
LC_TIME=POSIX
LC_ALL=POSIX/en_US.UTF-8/POSIX/POSIX/POSIX/POSIX

Regards,

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-15 Thread Eric Blake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to Vincent Lefevre on 1/15/2007 8:05 PM:
 Hi,
 
 Under Mac OS X 10.4.8 with ls (GNU coreutils) 5.97 (installed via
 MacPorts), in a 80-column terminal (uxterm), I get:
 
 $ ls
 É   y123456789012345678901234567890
 x123456789012345678901234567890  z123456789012345678901234567890

This is yet another symptom of a much larger issue - namely, coreutils
does not handle multi-byte locales well.  The problem is that no one has
yet written a patch that makes it easy to handle multibyte locales without
penalizing single-byte locales.

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFrEK+84KuGfSFAYARAgo4AJ9sx7SmmVcm7uzsAHcWxK+7GVb2iwCgoKZI
XDy07bliUTYTIzz37ZsA0xI=
=hZfX
-END PGP SIGNATURE-


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-15 Thread Vincent Lefevre
On 2007-01-15 20:13:02 -0700, Eric Blake wrote:
 According to Vincent Lefevre on 1/15/2007 8:05 PM:
  Under Mac OS X 10.4.8 with ls (GNU coreutils) 5.97 (installed via
  MacPorts), in a 80-column terminal (uxterm), I get:
  
  $ ls
  É   y123456789012345678901234567890
  x123456789012345678901234567890  z123456789012345678901234567890
 
 This is yet another symptom of a much larger issue - namely,
 coreutils does not handle multi-byte locales well. The problem is
 that no one has yet written a patch that makes it easy to handle
 multibyte locales without penalizing single-byte locales.

But I don't have this problem under Linux (Debian). Note: with the
example above, one needs LC_COLLATE=en_US.UTF-8 so that the É comes
first.

$ ls
Éy123456789012345678901234567890
x123456789012345678901234567890  z123456789012345678901234567890

In fact the problem seems to be due to the combining character under
Mac OS X. The filename É is encoded as 45 cc 81.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Alignment bug in ls with UTF-8 filenames under Mac OS X

2007-01-15 Thread Paul Eggert
Vincent Lefevre [EMAIL PROTECTED] writes:

 In fact the problem seems to be due to the combining character under
 Mac OS X. The filename É is encoded as 45 cc 81.

Most likely this has something to do with how mbrtowc and/or
wcwidth behaves on MacOS X.  Perhaps you can debug the quote_name
function of 'ls' on the affected file name, and see why it's
computing the width that it's computing?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils