Pádraig Brady wrote:
There have been some interesting counting UTF-8 strings threads
over at reddit lately, all referenced from this article:
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
But before these techniques can be used in practice in packages such as
coreutils,
Bruno Haible bruno at clisp.org writes:
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
But before these techniques can be used in practice in packages such as
coreutils, two problems would have to be solved satisfactorily:
1) George Pollard makes the assumption that
There have been some interesting counting UTF-8 strings threads
over at reddit lately, all referenced from this article:
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
Pádraig.
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
Pádraig Brady [EMAIL PROTECTED] wrote:
There have been some interesting counting UTF-8 strings threads
over at reddit lately, all referenced from this article:
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
Thanks for the pointer!
Interesting, indeed.
Pádraig Brady wrote:
mbstowcs doesn't canonicalize equivalent multibyte sequences,
and so therefore functions the same in this regard as our
processing of each wide character separately.
This could be considered a bug actually- i.e. should -m give
the number of wide chars, or the number of
Is there a good library for combining-character canonicalization
available? That seems like something that would be useful to have in a
lot of text-processing tools. Also, for Unicode, something to shuffle
between the normalization forms might be helpful for comparisons.
Such functionality
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
linepos += width;
if (iswspace (wide_char))
goto mb_word_separator;
+ else if (uc_combining_class (wide_char)
Bruno Haible wrote:
If you want wc to count characters after canonicalization, then you can
invent a new wc command-line option for it. But I would find it more useful
to have a filter program that reads from standard input and writes the
canonicalized output to standard output; that would be
Bruno Haible wrote:
As a consequence:
- The number of characters is the same as the number of wide characters.
- wc -m must output the number of characters.
- In a Unicode locale, U00E9 is one character, and U0065U0301 is
two characters,
Fair enough.
If you want wc to count
$ time ./wc -m long_lines.txt
13357046 long_lines.txt
real0m1.860s
It processes at the speed of 7 million characters per second. I would not call
this a horrible performance.
However wc calls mbrtowc() for each multibyte character.
Yes. One could use mbstowcs (or mbsnrtowcs, but that
Bruno Haible [EMAIL PROTECTED] wrote:
2008-05-08 Bruno Haible [EMAIL PROTECTED]
Speed up wc -m and wc -w in multibyte case.
* src/wc.c: Include mbchar.h.
(wc): New variable in_shift. Use it to avoid calling mbrtowc for most
ASCII characters.
Thanks!
I've applied
Jan Engelhardt wrote:
https://bugzilla.novell.com/show_bug.cgi?id=381873
Forwarding this because it is a GNU issue, not specifically a Novell one.
I reproduced this myself with the latest coreutils from git
(BTW: You might want to repack that repo, counting objects during the
clone was
Pádraig Brady wrote:
canonically équivalent
canonically équivalent
Pádraig.
p.s. I Notice that gnome-terminal still doesn't handle
combining characters correctly, and my mail client thunderbird
is putting the accent on the q rather than the e, sigh.
They both render correctly here
On Wednesday 2008-05-07 13:11, Pádraig Brady wrote:
Now that is a _lot_ of extra time. libiconv could probably be
made more efficient. I've never actually looked at it.
However wc calls mbrtowc() for each multibyte character.
It would probably be a lot more efficient to use mbstowcs()
to convert
Pádraig Brady [EMAIL PROTECTED] wrote:
Jan Engelhardt wrote:
https://bugzilla.novell.com/show_bug.cgi?id=381873
Forwarding this because it is a GNU issue, not specifically a Novell one.
I reproduced this myself with the latest coreutils from git
(BTW: You might want to repack that repo,
Jim Meyering wrote:
Bo Borgerson [EMAIL PROTECTED] wrote:
I may be misinterpreting your patch, but it seems to me that
decrementing count for zero-width characters could potentially lead to
confusion. Not all zero-width characters are combining characters, right?
It looks ok to me, since
Bo Borgerson wrote:
Pádraig Brady wrote:
canonically équivalent
canonically équivalent
Pádraig.
p.s. I Notice that gnome-terminal still doesn't handle
combining characters correctly, and my mail client thunderbird
is putting the accent on the q rather than the e, sigh.
They both
Bo Borgerson wrote:
Jim Meyering wrote:
Bo Borgerson [EMAIL PROTECTED] wrote:
I may be misinterpreting your patch, but it seems to me that
decrementing count for zero-width characters could potentially lead to
confusion. Not all zero-width characters are combining characters, right?
It
https://bugzilla.novell.com/show_bug.cgi?id=381873
Forwarding this because it is a GNU issue, not specifically a Novell one.
I reproduced this myself with the latest coreutils from git
(BTW: You might want to repack that repo, counting objects during the
clone was rather slow in the initial
19 matches
Mail list logo