Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Baptiste Daroussin
On Wed, Nov 25, 2015 at 06:31:16PM +0300, Andrey Chernov wrote:
> On 25.11.2015 18:12, Andrey Chernov wrote:
> > On 25.11.2015 17:35, Baptiste Daroussin wrote:
> >>> BTW, array size looks suspicious:
> >>> static wchar_t wab_months[12][MAX_ABMON_WIDTH * 2 * MB_LEN_MAX];
> >>> what MB_LEN_MAX doing here? This constant is for multiple-bytes encoded,
> >>> not for wide chars.
> >> Bad copy/paste sorry it should be "MAX_ABMON_WIDTH * 2"
> > 
> > I don't check deep enough, it seems first array
> > MAX_ABMON_WIDTH * MB_LEN_MAX + 1
> > and second one
> > MAX_ABMON_WIDTH * 2 + 1
> > 
> 
> No. We can't assume anything here and should integrate limits from the
> locale for months fields instead. F.e. in abstract general case in wide
> array can be 100 zero-width characters + 5 of normal characters, so
> width-oriented sizes not prevents overflowing.
> 
> First array size should be from locale internals,

There is no max size for an abbreviated month in the locale internals
So maybe the best here is to consider your first proposal:
MAX_ABMON_WIDTH * MB_LEN_MAX + 1 and MAX_ABMON_WIDTH * 2 + 1
Check that it does not overflow and fallack if it overflows.

Having 100 zero-width characters in a locale have very little probability to
happen and if it does one day, it problably means something went wrong in the
code generation

Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Andrey Chernov
On 25.11.2015 18:12, Andrey Chernov wrote:
> On 25.11.2015 17:35, Baptiste Daroussin wrote:
>>> BTW, array size looks suspicious:
>>> static wchar_t wab_months[12][MAX_ABMON_WIDTH * 2 * MB_LEN_MAX];
>>> what MB_LEN_MAX doing here? This constant is for multiple-bytes encoded,
>>> not for wide chars.
>> Bad copy/paste sorry it should be "MAX_ABMON_WIDTH * 2"
> 
> I don't check deep enough, it seems first array
> MAX_ABMON_WIDTH * MB_LEN_MAX + 1
> and second one
> MAX_ABMON_WIDTH * 2 + 1
> 

No. We can't assume anything here and should integrate limits from the
locale for months fields instead. F.e. in abstract general case in wide
array can be 100 zero-width characters + 5 of normal characters, so
width-oriented sizes not prevents overflowing.

First array size should be from locale internals,
second one == first * sizeof(wchar_t)
it will be safe for not overflowing.

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Andrey Chernov
On 25.11.2015 17:35, Baptiste Daroussin wrote:
>> BTW, array size looks suspicious:
>> static wchar_t wab_months[12][MAX_ABMON_WIDTH * 2 * MB_LEN_MAX];
>> what MB_LEN_MAX doing here? This constant is for multiple-bytes encoded,
>> not for wide chars.
> Bad copy/paste sorry it should be "MAX_ABMON_WIDTH * 2"

I don't check deep enough, it seems first array
MAX_ABMON_WIDTH * MB_LEN_MAX + 1
and second one
MAX_ABMON_WIDTH * 2 + 1

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Baptiste Daroussin
On Wed, Nov 25, 2015 at 04:34:17PM +0300, Andrey Chernov wrote:
> On 25.11.2015 15:53, Baptiste Daroussin wrote:
> > What I did for now is set max_month_width to -1 and in ls_strftime I 
> > fallback on
> > the plain strftime meaning you keep localized information but the 
> > alignement is
> > broken as of now.
> 
> It will be enough.
> 
> >> 3) wcwidth/wcswidth may return -1 too, it needs to be checked too.
> > done and truncate the name of the month to the latest valid character
> 
> I think there is no point for sofisticated truncating, if locale is not
> valid. F.e. you may truncate to just 1 char. The thing above will be
> enough for all such cases.
> 
> > Review updated (if you prefer a diff by mail just tell me, given you do not 
> > have
> > a phabricator account.)
> 
> I don't have phabricator.
> 
> This one
> if ((n = max_month_width - wab_months_width[i]) > 0) {
> wcslcat(wab_months[i], L" "/* MAX_ABMON_WIDTH */,
> max_month_width + 1);
> }
> should be
> if ((n = max_month_width - wab_months_width[i]) > 0)
> wcslcat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);
> 
> I.e. you append n spaces and n is the difference between max and current.

With wcsncat I would agree, but not with wcslcat.

wcslcat works like strlcat:
to quote the manpage:
"It will append at most dstsize - strlen(dst) - 1 characters."
So here with your version it will be n - wcslen(wab_months[i]) -1
which won't fit what we want to do.

btw that makes me figure out that what I want is wcsncat because we do care to
make sure we have appended the right number of spaces at the end of the
abbreviated month based on character width, not the on the len of the wide
string

so I changed it

if ((n = max_month_width - wab_months_width[i]) > 0)
wcsncat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);

If you do not have further comments, I will commit that later today.

Then I will work on libarchive and pax (would you like to review the patches for
those?)

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Baptiste Daroussin
On Wed, Nov 25, 2015 at 04:31:21AM +0300, Andrey Chernov wrote:
> On 25.11.2015 3:15, Baptiste Daroussin wrote:
> > On Sat, Nov 21, 2015 at 11:57:46PM +0300, Andrey Chernov wrote:
> >> On 21.11.2015 15:18, Ed Schouten wrote:
> >>> Hi Baptiste,
> >>>
> >>> I suppose you should use the wcswidth() function somewhere to compute
> >>> the visible width of the month name. Some characters may be
> >>> double-width, others may have no effective width at all.
> >>>
> >>
> >> I agree. Checking error return of wide chars functions with some
> >> fallback will be good too.
> > 
> > I have updated the code https://reviews.freebsd.org/D4239
> > 
> > Tested by modifying some locales to add double width and zero width unicode 
> > in
> > the locales
> > 
> > Also added the error checking for the return of wide chars functions. For 
> > now I
> > haven't added fallback, suggestions welcome if needed.
> 
> 1) For just 1 char in wcswidth(_months[i][j], 1); it is better to
> use another function wcwidth(wab_months[i][j]);

done
> 
> 2) By fallback I mean something which not stops ls working with
> incorrect for some reason locale, like setting max_width_month to
> MAX_ABMON_WIDTH on error return (from
> mbstowcs/wcwidth/wcswidth/wcswidth) and exit from
> populate_abbreviated_month().

Not that easy to provide a fallback (or better to say I can't find a proper way)
it if fails on mbstocws then ab_months is not populated so unusable.
What I did for now is set max_month_width to -1 and in ls_strftime I fallback on
the plain strftime meaning you keep localized information but the alignement is
broken as of now.

> 
> 3) wcwidth/wcswidth may return -1 too, it needs to be checked too.
done and truncate the name of the month to the latest valid character
> 
> 4) The whole processing looks overcomplicated and not effective. What
> about this instead?
> for (i = 0; i < 12; i++) {
> count wcswidth() of each month and store it in wab_months_width[].
> count max_width_month.
> }
> for (i = 0; i < 12; i++) {
> if ((n = max_width_month - wab_months_width[i]) > 0)
>   call wcscat(wab_months[i], L" ") n times.
> }

Done, along with your optimisation from your next mail
> 
> 5) If there is no %b is strftime() format, there is no sense to spend
> CPU cycles on from populate_abbreviated_month(), so it should be called
> only once inside ls_strftime() on first %b instead of calling it in
> printtime() for all cases.

done

Review updated (if you prefer a diff by mail just tell me, given you do not have
a phabricator account.)

Thanks for your patience! and reviews!

Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Andrey Chernov
On 25.11.2015 16:51, Baptiste Daroussin wrote:
> wcslcat works like strlcat:
> to quote the manpage:
> "It will append at most dstsize - strlen(dst) - 1 characters."
> So here with your version it will be n - wcslen(wab_months[i]) -1
> which won't fit what we want to do.
> 
> btw that makes me figure out that what I want is wcsncat because we do care to
> make sure we have appended the right number of spaces at the end of the
> abbreviated month based on character width, not the on the len of the wide
> string
> 
> so I changed it
> 
> if ((n = max_month_width - wab_months_width[i]) > 0)
> wcsncat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);

Sorry, I initially mean wcsncat() functionality here, wcslcat() is
sneaked in due to wrong memorizing.

About width counting and fallback...
Without attempt to nicely truncate data damaged by unknown way the code
will be simpler. Instead all that loop adding width one by one wchar, just:

width = wcswidth(wab_months[i], MAX_ABMON_WIDTH);
if (width == -1) {
max_month_width = -1;
return;
}
wab_months_width[i] = width;

About
/* NULL terminate to simplify padding later */
wab_months[i][j] = L'\0';
You don't need it, mbstowcs() null-terminates result, if there is a room.
BTW, array size looks suspicious:
static wchar_t wab_months[12][MAX_ABMON_WIDTH * 2 * MB_LEN_MAX];
what MB_LEN_MAX doing here? This constant is for multiple-bytes encoded,
not for wide chars.

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Andrey Chernov
On 25.11.2015 15:53, Baptiste Daroussin wrote:
> What I did for now is set max_month_width to -1 and in ls_strftime I fallback 
> on
> the plain strftime meaning you keep localized information but the alignement 
> is
> broken as of now.

It will be enough.

>> 3) wcwidth/wcswidth may return -1 too, it needs to be checked too.
> done and truncate the name of the month to the latest valid character

I think there is no point for sofisticated truncating, if locale is not
valid. F.e. you may truncate to just 1 char. The thing above will be
enough for all such cases.

> Review updated (if you prefer a diff by mail just tell me, given you do not 
> have
> a phabricator account.)

I don't have phabricator.

This one
if ((n = max_month_width - wab_months_width[i]) > 0) {
wcslcat(wab_months[i], L" "/* MAX_ABMON_WIDTH */,
max_month_width + 1);
}
should be
if ((n = max_month_width - wab_months_width[i]) > 0)
wcslcat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);

I.e. you append n spaces and n is the difference between max and current.

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-25 Thread Baptiste Daroussin
On Wed, Nov 25, 2015 at 05:06:17PM +0300, Andrey Chernov wrote:
> On 25.11.2015 16:51, Baptiste Daroussin wrote:
> > wcslcat works like strlcat:
> > to quote the manpage:
> > "It will append at most dstsize - strlen(dst) - 1 characters."
> > So here with your version it will be n - wcslen(wab_months[i]) -1
> > which won't fit what we want to do.
> > 
> > btw that makes me figure out that what I want is wcsncat because we do care 
> > to
> > make sure we have appended the right number of spaces at the end of the
> > abbreviated month based on character width, not the on the len of the wide
> > string
> > 
> > so I changed it
> > 
> > if ((n = max_month_width - wab_months_width[i]) > 0)
> > wcsncat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);
> 
> Sorry, I initially mean wcsncat() functionality here, wcslcat() is
> sneaked in due to wrong memorizing.
> 
> About width counting and fallback...
> Without attempt to nicely truncate data damaged by unknown way the code
> will be simpler. Instead all that loop adding width one by one wchar, just:
> 
> width = wcswidth(wab_months[i], MAX_ABMON_WIDTH);
> if (width == -1) {
> max_month_width = -1;
> return;
> }
> wab_months_width[i] = width;
> 
> About
> /* NULL terminate to simplify padding later */
> wab_months[i][j] = L'\0';
> You don't need it, mbstowcs() null-terminates result, if there is a room.

right if should be only set on the month we do truncate to MAX_ABMON_WIDTH

> BTW, array size looks suspicious:
> static wchar_t wab_months[12][MAX_ABMON_WIDTH * 2 * MB_LEN_MAX];
> what MB_LEN_MAX doing here? This constant is for multiple-bytes encoded,
> not for wide chars.
Bad copy/paste sorry it should be "MAX_ABMON_WIDTH * 2"


I will fix
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-24 Thread Baptiste Daroussin
On Wed, Nov 25, 2015 at 01:15:13AM +0100, Baptiste Daroussin wrote:
> On Sat, Nov 21, 2015 at 11:57:46PM +0300, Andrey Chernov wrote:
> > On 21.11.2015 15:18, Ed Schouten wrote:
> > > Hi Baptiste,
> > > 
> > > I suppose you should use the wcswidth() function somewhere to compute
> > > the visible width of the month name. Some characters may be
> > > double-width, others may have no effective width at all.
> > > 
> > 
> > I agree. Checking error return of wide chars functions with some
> > fallback will be good too.
> 
> I have updated the code https://reviews.freebsd.org/D4239
> 
> Tested by modifying some locales to add double width and zero width unicode in
> the locales
> 
> Also added the error checking for the return of wide chars functions. For now 
> I
> haven't added fallback, suggestions welcome if needed.
> 
> Best regards,
> Bapt

Actually I can make the fallback on the C locale in case of failure. Would that
work for you?

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-24 Thread Baptiste Daroussin
On Sat, Nov 21, 2015 at 11:57:46PM +0300, Andrey Chernov wrote:
> On 21.11.2015 15:18, Ed Schouten wrote:
> > Hi Baptiste,
> > 
> > I suppose you should use the wcswidth() function somewhere to compute
> > the visible width of the month name. Some characters may be
> > double-width, others may have no effective width at all.
> > 
> 
> I agree. Checking error return of wide chars functions with some
> fallback will be good too.

I have updated the code https://reviews.freebsd.org/D4239

Tested by modifying some locales to add double width and zero width unicode in
the locales

Also added the error checking for the return of wide chars functions. For now I
haven't added fallback, suggestions welcome if needed.

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-24 Thread Andrey Chernov
On 25.11.2015 3:15, Baptiste Daroussin wrote:
> On Sat, Nov 21, 2015 at 11:57:46PM +0300, Andrey Chernov wrote:
>> On 21.11.2015 15:18, Ed Schouten wrote:
>>> Hi Baptiste,
>>>
>>> I suppose you should use the wcswidth() function somewhere to compute
>>> the visible width of the month name. Some characters may be
>>> double-width, others may have no effective width at all.
>>>
>>
>> I agree. Checking error return of wide chars functions with some
>> fallback will be good too.
> 
> I have updated the code https://reviews.freebsd.org/D4239
> 
> Tested by modifying some locales to add double width and zero width unicode in
> the locales
> 
> Also added the error checking for the return of wide chars functions. For now 
> I
> haven't added fallback, suggestions welcome if needed.

1) For just 1 char in wcswidth(_months[i][j], 1); it is better to
use another function wcwidth(wab_months[i][j]);

2) By fallback I mean something which not stops ls working with
incorrect for some reason locale, like setting max_width_month to
MAX_ABMON_WIDTH on error return (from
mbstowcs/wcwidth/wcswidth/wcswidth) and exit from
populate_abbreviated_month().

3) wcwidth/wcswidth may return -1 too, it needs to be checked too.

4) The whole processing looks overcomplicated and not effective. What
about this instead?
for (i = 0; i < 12; i++) {
count wcswidth() of each month and store it in wab_months_width[].
count max_width_month.
}
for (i = 0; i < 12; i++) {
if ((n = max_width_month - wab_months_width[i]) > 0)
call wcscat(wab_months[i], L" ") n times.
}

5) If there is no %b is strftime() format, there is no sense to spend
CPU cycles on from populate_abbreviated_month(), so it should be called
only once inside ls_strftime() on first %b instead of calling it in
printtime() for all cases.

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-24 Thread Andrey Chernov
On 25.11.2015 4:31, Andrey Chernov wrote:
> 4) The whole processing looks overcomplicated and not effective. What
> about this instead?
> for (i = 0; i < 12; i++) {
> count wcswidth() of each month and store it in wab_months_width[].
> count max_width_month.
> }
> for (i = 0; i < 12; i++) {
> if ((n = max_width_month - wab_months_width[i]) > 0)
>   call wcscat(wab_months[i], L" ") n times.
> }

Last line can be optimized further:
wcslcat(wab_months[i], L" "/* MAX_ABMON_WIDTH */, n);

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-21 Thread Andrey Chernov
On 21.11.2015 15:18, Ed Schouten wrote:
> Hi Baptiste,
> 
> I suppose you should use the wcswidth() function somewhere to compute
> the visible width of the month name. Some characters may be
> double-width, others may have no effective width at all.
> 

I agree. Checking error return of wide chars functions with some
fallback will be good too.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-21 Thread Ed Schouten
Hi Baptiste,

I suppose you should use the wcswidth() function somewhere to compute
the visible width of the month name. Some characters may be
double-width, others may have no effective width at all.

-- 
Ed Schouten 
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Jilles Tjoelker
On Fri, Nov 20, 2015 at 12:02:13PM +0100, Baptiste Daroussin wrote:
> On Fri, Nov 20, 2015 at 11:42:53AM +0100, Baptiste Daroussin wrote:
> > On Fri, Nov 20, 2015 at 11:05:56AM +0300, Sergey V. Dyatko wrote:
> > > subj. http://i.imgur.com/F9QO29l.png
> > > it is on head@r290573:
> > > WTR:
> > > env LC_ALL=uk_UA.UTF-8 ls -la /usr/ports/databases/ or env 
> > > LC_ALL=ru_RU.UTF-8
> > > ls -la /usr/ports/databases/

> > > env LC_ALL=C ls -la /usr/ports/databases/ works fine
> > > also on old stable/10 (r286868)  as I can see 'month' field length 3 
> > > symbols 

> > Thanks for reporting, I can reproduce the issue with some other
> > locales. The thing is there seems to be no standard for abbreviated
> > length. Formerly we had a 3 character lenght for abbreviated month.

> > We now use CLDR which seems to follow the abbreviated rules from IBM:
> > "Each string must be of equal length and contain 5 characters or less"

> > There are 2 possible fixes: either always pad those in the locale
> > definition which seems wrong or modify ls so that it by itself pads
> > properly.

> > Neither posix nor ISO-14652 defines the length of the abbreviated form

> > padding in the locales themself would be wrong so I do propose to
> > pad in the ls command. And padding with 5 characters.

> For the record glibc/linux had the same problem:
> https://sourceware.org/bugzilla/show_bug.cgi?id=9859

> "fixed" in coreutils (gnu ls) the way I propose to do for us
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=612b647dd16d5abc03b295abe42d8b4a0fe660f7

Coreutils fixed it slightly better: it calculates the maximum width of
the abbreviated month names and pads to that (with a maximum of 5). In
particular, this ensures that the output does not change for locales
that have 3-character abbreviations, such as the POSIX locale. I think
this is valuable.

They also keep the list of month names from this calculation and they
say this speeds up ls noticeably compared to having strftime() expand %b
for every file.
-- 
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Baptiste Daroussin
On Fri, Nov 20, 2015 at 11:05:56AM +0300, Sergey V. Dyatko wrote:
> Hi,
> 
> subj. http://i.imgur.com/F9QO29l.png
> it is on head@r290573:
> WTR:
> env LC_ALL=uk_UA.UTF-8 ls -la /usr/ports/databases/ or env LC_ALL=ru_RU.UTF-8
> ls -la /usr/ports/databases/
> 
> env LC_ALL=C ls -la /usr/ports/databases/ works fine
> also on old stable/10 (r286868)  as I can see 'month' field length 3 symbols 
> 
Thanks for reporting, I can reproduce the issue with some other locales. The
thing is there seems to be no standard for abbreviated length. Formerly we had a
3 character lenght for abbreviated month.

We now use CLDR which seems to follow the abbreviated rules from IBM:
"Each string must be of equal length and contain 5 characters or less"

There are 2 possible fixes: either always pad those in the locale definition
which seems wrong or modify ls so that it by itself pads properly.

Neither posix nor ISO-14652 defines the length of the abbreviated form

padding in the locales themself would be wrong so I do propose to pad in the ls
command. And padding with 5 characters.

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Baptiste Daroussin
On Fri, Nov 20, 2015 at 11:42:53AM +0100, Baptiste Daroussin wrote:
> On Fri, Nov 20, 2015 at 11:05:56AM +0300, Sergey V. Dyatko wrote:
> > Hi,
> > 
> > subj. http://i.imgur.com/F9QO29l.png
> > it is on head@r290573:
> > WTR:
> > env LC_ALL=uk_UA.UTF-8 ls -la /usr/ports/databases/ or env 
> > LC_ALL=ru_RU.UTF-8
> > ls -la /usr/ports/databases/
> > 
> > env LC_ALL=C ls -la /usr/ports/databases/ works fine
> > also on old stable/10 (r286868)  as I can see 'month' field length 3 
> > symbols 
> > 
> Thanks for reporting, I can reproduce the issue with some other locales. The
> thing is there seems to be no standard for abbreviated length. Formerly we 
> had a
> 3 character lenght for abbreviated month.
> 
> We now use CLDR which seems to follow the abbreviated rules from IBM:
> "Each string must be of equal length and contain 5 characters or less"
> 
> There are 2 possible fixes: either always pad those in the locale definition
> which seems wrong or modify ls so that it by itself pads properly.
> 
> Neither posix nor ISO-14652 defines the length of the abbreviated form
> 
> padding in the locales themself would be wrong so I do propose to pad in the 
> ls
> command. And padding with 5 characters.
> 
> Best regards,
> Bapt

For the record glibc/linux had the same problem:
https://sourceware.org/bugzilla/show_bug.cgi?id=9859

"fixed" in coreutils (gnu ls) the way I propose to do for us
http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=612b647dd16d5abc03b295abe42d8b4a0fe660f7

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Baptiste Daroussin
On Fri, Nov 20, 2015 at 01:23:52PM +0100, Jilles Tjoelker wrote:
> On Fri, Nov 20, 2015 at 12:02:13PM +0100, Baptiste Daroussin wrote:
> > On Fri, Nov 20, 2015 at 11:42:53AM +0100, Baptiste Daroussin wrote:
> > > On Fri, Nov 20, 2015 at 11:05:56AM +0300, Sergey V. Dyatko wrote:
> > > > subj. http://i.imgur.com/F9QO29l.png
> > > > it is on head@r290573:
> > > > WTR:
> > > > env LC_ALL=uk_UA.UTF-8 ls -la /usr/ports/databases/ or env 
> > > > LC_ALL=ru_RU.UTF-8
> > > > ls -la /usr/ports/databases/
> 
> > > > env LC_ALL=C ls -la /usr/ports/databases/ works fine
> > > > also on old stable/10 (r286868)  as I can see 'month' field length 3 
> > > > symbols 
> 
> > > Thanks for reporting, I can reproduce the issue with some other
> > > locales. The thing is there seems to be no standard for abbreviated
> > > length. Formerly we had a 3 character lenght for abbreviated month.
> 
> > > We now use CLDR which seems to follow the abbreviated rules from IBM:
> > > "Each string must be of equal length and contain 5 characters or less"
> 
> > > There are 2 possible fixes: either always pad those in the locale
> > > definition which seems wrong or modify ls so that it by itself pads
> > > properly.
> 
> > > Neither posix nor ISO-14652 defines the length of the abbreviated form
> 
> > > padding in the locales themself would be wrong so I do propose to
> > > pad in the ls command. And padding with 5 characters.
> 
> > For the record glibc/linux had the same problem:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=9859
> 
> > "fixed" in coreutils (gnu ls) the way I propose to do for us
> > http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=612b647dd16d5abc03b295abe42d8b4a0fe660f7
> 
> Coreutils fixed it slightly better: it calculates the maximum width of
> the abbreviated month names and pads to that (with a maximum of 5). In
> particular, this ensures that the output does not change for locales
> that have 3-character abbreviations, such as the POSIX locale. I think
> this is valuable.
> 
> They also keep the list of month names from this calculation and they
> say this speeds up ls noticeably compared to having strftime() expand %b
> for every file.

Here is what I do propose (sorry for the ugly pad_to_col name, if one has better
please share :D

https://reviews.freebsd.org/D4239

Best regards,
Bapt


signature.asc
Description: PGP signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Andrey Chernov
On 21.11.2015 3:35, Baptiste Daroussin wrote:

> Here is what I do propose (sorry for the ugly pad_to_col name, if one has 
> better
> please share :D
> 
> https://reviews.freebsd.org/D4239

The whole function is ugly, not only its name. Please no hardcoded
constants assuming some internal encoding knowledge, they are wrong for
non-UTF-8 encodings in any case, use wide chars instead.

BTW, the same 3 chars restriction is in tar, cpio, pax, lots of ftp
clients, i.e. where 'ls' emulated.

-- 
http://ache.vniz.net/



signature.asc
Description: OpenPGP digital signature


Re: /bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Baptiste Daroussin
On Sat, Nov 21, 2015 at 03:45:29AM +0300, Andrey Chernov wrote:
> On 21.11.2015 3:35, Baptiste Daroussin wrote:
> 
> > Here is what I do propose (sorry for the ugly pad_to_col name, if one has 
> > better
> > please share :D
> > 
> > https://reviews.freebsd.org/D4239
> 
> The whole function is ugly, not only its name. Please no hardcoded
> constants assuming some internal encoding knowledge, they are wrong for
> non-UTF-8 encodings in any case, use wide chars instead.
> 
> BTW, the same 3 chars restriction is in tar, cpio, pax, lots of ftp
> clients, i.e. where 'ls' emulated.
> 
Updated to use wide char functions. tested with ru_RU.UTF-8 and ru_RU.KOI8-R,
fr_FR.UTF-8, ja_JP.eucJP and some chinese locales.

Concerning tar and cpio, I can probably push the same thing into bsdtar (which
will fix the same issue they also have on linux), ftp is safe, pax is not and
could probably be fixed as well.

FYI on linux only ls has been fixed. gnutar uses a different format for its
output.

Bapt


signature.asc
Description: PGP signature


/bin/ls formatting broken for non-C(?) locales

2015-11-20 Thread Sergey V. Dyatko
Hi,

subj. http://i.imgur.com/F9QO29l.png
it is on head@r290573:
WTR:
env LC_ALL=uk_UA.UTF-8 ls -la /usr/ports/databases/ or env LC_ALL=ru_RU.UTF-8
ls -la /usr/ports/databases/

env LC_ALL=C ls -la /usr/ports/databases/ works fine
also on old stable/10 (r286868)  as I can see 'month' field length 3 symbols 


--
wbr, tiger

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"