Hello Coreutils maintainers!
I've recently spent some time adding multibyte support to the coreutils
text processing tools (sort, uniq, join, tr, cut, paste, expand, unexpand,
fmt, fold, and pr) in this repository:
https://github.com/ericfischer/coreutils-utf8
I haven't tackled cut -bn yet,
Thanks for the feedback. My changes are now in the "multibyte" branch at
https://github.com/ericfischer/coreutils/tree/multibyte
branched from the savannah coreutils repository.
I've moved my lib changes (all multibyte or wide versions of existing
single-byte functions) into a shared file in t
You were right that I needed to pay attention to character widths. My
changes in
https://github.com/ericfischer/coreutils/tree/multibyte
will now handle character widths in all the places where POSIX counts
"column positions" instead of characters.
I have also introduced a "grapheme" abstracti
Thanks for the feedback.
To clear one thing up at the start: I am not Eric Blake, so the earlier cut
-d patch is not mine.
Thanks also for clarifying the license requirements. I will follow up with
Mapbox legal to find out how we can work with this.
Sebastian, I think you may have been testing a
I am now tracking which of Assaf's tests my implementation passes and fails
in https://github.com/ericfischer/coreutils/issues/2. The ones that fail
seem to be because:
* I have not implemented cut -n
* My tr will not remove bytes from the middle of characters
* Linux and MacOS disagree about whet
OK, that seems reasonable, since as far as I know, no one implements the
POSIX notation for constructing multibyte characters out of adjacent octal
escapes anyway, and the standard has already backed off from supporting
them in ranges. I'll have to change mine to leave characters decomposed
until a
Or actually I *won't* necessarily have to change my version of tr, because
the real point of this thread isn't to get my own changes accepted, it's to
get *some* reasonable multibyte implementation of the utilities, regardless
of whose it is, into the standard coreutils distribution.
Eric
only gives me a month to
resolve whatever else needs to be done, so there is some urgency about this
from my perspective. Thanks,
Eric Fischer
Thanks. The paperwork code is [gnu.org #1262124]. The deadline comes from
the copyright assignment document, which "applies to all past and future
works, made by April 30, 2018, of Developer…." I'm not sure quite how that
came to be the agreement between Mapbox and FSF, but that's what I've got
to
Thanks for the list of things to do. Most of them are done now:
* I have added documentation for what has changed in each program.
* The new files have copyright headers now.
* "make check" succeeds.
* "make syntax-check" succeeds, except for a complaint about strftime in
code that I ha
Thanks all. In this case the changes to each program are fairly monolithic
and are all for the single purpose of replacing byte-oriented processing
with character-oriented processing, so bisecting changes will probably not
be very useful. The squashed commits, one per program, are now in a new
bran
I also found wcwidth to be a bad performance bottleneck in my multibyte
branch of coreutils. To fix the problem in my branch, I added a cache of
the widths returned for characters in the range from 0 to UCHAR_MAX (which
perhaps should also be widened to include a few other common alphabets).
The ca
I should also add that the core reason that wc is slow and Python is fast
is not that UTF-8 decoding in wc is slow, it is that the Python code is
just counting characters, while wc is also maintaining a line width
for --max-line-length. It doesn't really need to do this, and probably
shouldn't do t
On Thu, May 17, 2018 at 6:04 PM, Kaz Kylheku (Coreutils) <
962-396-1...@kylheku.com> wrote:
What are the requirements underpinning "wc -m", and how do these
> iswprint and iswspace functions fit into it?
>
…
> Nowhere does POSIX say that the display width of a character
> has to be obtained in "w
On Thu, May 17, 2018 at 5:54 PM, Kaz Kylheku (Coreutils) <
962-396-1...@kylheku.com> wrote:
In what situation are there printable characters in the range [0,
> UCHAR_MAX) that
> have a width > 1?
I agree that it is unlikely, but POSIX doesn't specify anything about the
width of particular charac
Thank you!
On my MacOS X system, wc appears to be calling uc_width, so I think it is
the replacement and not the system wcwidth that is the slow path.
Eric
For whatever it's worth, the system wcwidth seems to be much faster on my
MacOS X system (10.11.6) than the replacement wcwidth. Using the same
benchmark as above, it takes about 0.9 seconds with the replacement wcwidth:
$ yes | head -n10 > mbc.txt
$ yes 123456789012345678
I fixed this in
https://github.com/ericfischer/coreutils/commit/093e08f91318889d7159fa8ce6afa74650b66ea3
but
it and the rest of my multibyte fixes have been sitting unmerged for a year.
Eric
On Tue, Mar 26, 2019 at 7:23 AM Tim Rühsen wrote:
> Hi,
>
> was just trying to "grep saved *.log|cut -d‘
I will reopen the can of worms of again offering my own multibyte cut (and
other coreutils) if the maintainers ever decide they want it:
https://github.com/ericfischer/coreutils/blob/multibyte-squash/src/cut.c
I think the normalization ambiguity here is resolved by the POSIX
standard's distinct
Unfortunately, multibyte collation is simply unimplemented in MacOS X, so
there is no alternate locale definition that will fix it. As far as I can
tell this is documented only in the BUGS section of `man wcscoll`:
BUGS
The current implementation of wcscoll() only works in single-byte
LC
This and other utf-8 bugs are fixed in
https://github.com/ericfischer/coreutils/tree/multibyte-squash if anyone
ever wants to accept the patch.
Eric
On Fri, Nov 13, 2020 at 6:48 AM ✓ Paul Courbis de Bridiers de Villemor <
p...@courbis.fr> wrote:
> Hi
>
> I'm using expand to get formatted output.
A couple of years ago I went down this route of thinking I would add CSV
support to sort, and then let myself get distracted into trying to follow
https://paulfitz.github.io/2017/01/24/the-year-of-poop-on-the-desktop.html
instead. The problem with that is that coreutils doesn't work with
multibyte
22 matches
Mail list logo