Re: horrible utf-8 performace in wc

2008-06-06 Thread Bruno Haible
Pádraig Brady wrote: There have been some interesting counting UTF-8 strings threads over at reddit lately, all referenced from this article: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html But before these techniques can be used in practice in packages such as coreutils,

Re: horrible utf-8 performace in wc

2008-06-06 Thread Eric Blake
Bruno Haible bruno at clisp.org writes: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html But before these techniques can be used in practice in packages such as coreutils, two problems would have to be solved satisfactorily: 1) George Pollard makes the assumption that

Re: horrible utf-8 performace in wc

2008-06-05 Thread Pádraig Brady
There have been some interesting counting UTF-8 strings threads over at reddit lately, all referenced from this article: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org

Re: horrible utf-8 performace in wc

2008-06-05 Thread Jim Meyering
Pádraig Brady [EMAIL PROTECTED] wrote: There have been some interesting counting UTF-8 strings threads over at reddit lately, all referenced from this article: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html Thanks for the pointer! Interesting, indeed.

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
Pádraig Brady wrote: mbstowcs doesn't canonicalize equivalent multibyte sequences, and so therefore functions the same in this regard as our processing of each wide character separately. This could be considered a bug actually- i.e. should -m give the number of wide chars, or the number of

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
Is there a good library for combining-character canonicalization available? That seems like something that would be useful to have in a lot of text-processing tools. Also, for Unicode, something to shuffle between the normalization forms might be helpful for comparisons. Such functionality

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) linepos += width; if (iswspace (wide_char)) goto mb_word_separator; + else if (uc_combining_class (wide_char)

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Bruno Haible wrote: If you want wc to count characters after canonicalization, then you can invent a new wc command-line option for it. But I would find it more useful to have a filter program that reads from standard input and writes the canonicalized output to standard output; that would be

Re: horrible utf-8 performace in wc

2008-05-08 Thread Pádraig Brady
Bruno Haible wrote: As a consequence: - The number of characters is the same as the number of wide characters. - wc -m must output the number of characters. - In a Unicode locale, U00E9 is one character, and U0065U0301 is two characters, Fair enough. If you want wc to count

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
$ time ./wc -m long_lines.txt 13357046 long_lines.txt real0m1.860s It processes at the speed of 7 million characters per second. I would not call this a horrible performance. However wc calls mbrtowc() for each multibyte character. Yes. One could use mbstowcs (or mbsnrtowcs, but that

Re: horrible utf-8 performace in wc

2008-05-08 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote: 2008-05-08 Bruno Haible [EMAIL PROTECTED] Speed up wc -m and wc -w in multibyte case. * src/wc.c: Include mbchar.h. (wc): New variable in_shift. Use it to avoid calling mbrtowc for most ASCII characters. Thanks! I've applied

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Jan Engelhardt wrote: https://bugzilla.novell.com/show_bug.cgi?id=381873 Forwarding this because it is a GNU issue, not specifically a Novell one. I reproduced this myself with the latest coreutils from git (BTW: You might want to repack that repo, counting objects during the clone was

Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Pádraig Brady wrote: canonically équivalent canonically équivalent Pádraig. p.s. I Notice that gnome-terminal still doesn't handle combining characters correctly, and my mail client thunderbird is putting the accent on the q rather than the e, sigh. They both render correctly here

Re: horrible utf-8 performace in wc

2008-05-07 Thread Jan Engelhardt
On Wednesday 2008-05-07 13:11, Pádraig Brady wrote: Now that is a _lot_ of extra time. libiconv could probably be made more efficient. I've never actually looked at it. However wc calls mbrtowc() for each multibyte character. It would probably be a lot more efficient to use mbstowcs() to convert

Re: horrible utf-8 performace in wc

2008-05-07 Thread Jim Meyering
Pádraig Brady [EMAIL PROTECTED] wrote: Jan Engelhardt wrote: https://bugzilla.novell.com/show_bug.cgi?id=381873 Forwarding this because it is a GNU issue, not specifically a Novell one. I reproduced this myself with the latest coreutils from git (BTW: You might want to repack that repo,

Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Jim Meyering wrote: Bo Borgerson [EMAIL PROTECTED] wrote: I may be misinterpreting your patch, but it seems to me that decrementing count for zero-width characters could potentially lead to confusion. Not all zero-width characters are combining characters, right? It looks ok to me, since

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Bo Borgerson wrote: Pádraig Brady wrote: canonically équivalent canonically équivalent Pádraig. p.s. I Notice that gnome-terminal still doesn't handle combining characters correctly, and my mail client thunderbird is putting the accent on the q rather than the e, sigh. They both

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Bo Borgerson wrote: Jim Meyering wrote: Bo Borgerson [EMAIL PROTECTED] wrote: I may be misinterpreting your patch, but it seems to me that decrementing count for zero-width characters could potentially lead to confusion. Not all zero-width characters are combining characters, right? It

horrible utf-8 performace in wc

2008-05-06 Thread Jan Engelhardt
https://bugzilla.novell.com/show_bug.cgi?id=381873 Forwarding this because it is a GNU issue, not specifically a Novell one. I reproduced this myself with the latest coreutils from git (BTW: You might want to repack that repo, counting objects during the clone was rather slow in the initial