Re: Support for CSV file format on sort

2021-01-30 Thread Eric Fischer
A couple of years ago I went down this route of thinking I would add CSV support to sort, and then let myself get distracted into trying to follow https://paulfitz.github.io/2017/01/24/the-year-of-poop-on-the-desktop.html instead. The problem with that is that coreutils doesn't work with multibyte

Re: Bug in expand ?

2020-11-13 Thread Eric Fischer
This and other utf-8 bugs are fixed in https://github.com/ericfischer/coreutils/tree/multibyte-squash if anyone ever wants to accept the patch. Eric On Fri, Nov 13, 2020 at 6:48 AM ✓ Paul Courbis de Bridiers de Villemor < p...@courbis.fr> wrote: > Hi > > I'm using expand to get formatted

Re: How to sort unicode properly?

2019-09-25 Thread Eric Fischer
Unfortunately, multibyte collation is simply unimplemented in MacOS X, so there is no alternate locale definition that will fix it. As far as I can tell this is documented only in the BUGS section of `man wcscoll`: BUGS The current implementation of wcscoll() only works in single-byte

Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"

2019-08-12 Thread Eric Fischer
I will reopen the can of worms of again offering my own multibyte cut (and other coreutils) if the maintainers ever decide they want it: https://github.com/ericfischer/coreutils/blob/multibyte-squash/src/cut.c I think the normalization ambiguity here is resolved by the POSIX standard's

Re: cut -d fails when using a multi-byte delimiter

2019-03-26 Thread Eric Fischer
I fixed this in https://github.com/ericfischer/coreutils/commit/093e08f91318889d7159fa8ce6afa74650b66ea3 but it and the rest of my multibyte fixes have been sitting unmerged for a year. Eric On Tue, Mar 26, 2019 at 7:23 AM Tim Rühsen wrote: > Hi, > > was just trying to "grep saved *.log|cut

Re: performance bug of `wc -m`

2018-05-18 Thread Eric Fischer
For whatever it's worth, the system wcwidth seems to be much faster on my MacOS X system (10.11.6) than the replacement wcwidth. Using the same benchmark as above, it takes about 0.9 seconds with the replacement wcwidth: $ yes | head -n10 > mbc.txt $ yes

Re: performance bug of `wc -m`

2018-05-18 Thread Eric Fischer
Thank you! On my MacOS X system, wc appears to be calling uc_width, so I think it is the replacement and not the system wcwidth that is the slow path. Eric

Re: performance bug of `wc -m`

2018-05-17 Thread Eric Fischer
On Thu, May 17, 2018 at 5:54 PM, Kaz Kylheku (Coreutils) < 962-396-1...@kylheku.com> wrote: In what situation are there printable characters in the range [0, > UCHAR_MAX) that > have a width > 1? I agree that it is unlikely, but POSIX doesn't specify anything about the width of particular

Re: performance bug of `wc -m`

2018-05-17 Thread Eric Fischer
On Thu, May 17, 2018 at 6:04 PM, Kaz Kylheku (Coreutils) < 962-396-1...@kylheku.com> wrote: What are the requirements underpinning "wc -m", and how do these > iswprint and iswspace functions fit into it? > … > Nowhere does POSIX say that the display width of a character > has to be obtained in

Re: performance bug of `wc -m`

2018-05-16 Thread Eric Fischer
I should also add that the core reason that wc is slow and Python is fast is not that UTF-8 decoding in wc is slow, it is that the Python code is just counting characters, while wc is also maintaining a line width for --max-line-length. It doesn't really need to do this, and probably shouldn't do

Re: performance bug of `wc -m`

2018-05-16 Thread Eric Fischer
I also found wcwidth to be a bad performance bottleneck in my multibyte branch of coreutils. To fix the problem in my branch, I added a cache of the widths returned for characters in the range from 0 to UCHAR_MAX (which perhaps should also be widened to include a few other common alphabets). The

bug#31033: [PATCH] Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand

2018-04-02 Thread Eric Fischer
or options properly. I propose the changes in https://github.com/ericfischer/coreutils/compare/multibyte-squash to convert sort, uniq, join, tr, cut, paste, expand, and unexpand to process characters instead of bytes, allowing them to work correctly on non-ASCII text, as specified by POSIX. Eric

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-03-28 Thread Eric Fischer
Thanks all. In this case the changes to each program are fairly monolithic and are all for the single purpose of replacing byte-oriented processing with character-oriented processing, so bisecting changes will probably not be very useful. The squashed commits, one per program, are now in a new

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-03-26 Thread Eric Fischer
Thanks for the list of things to do. Most of them are done now: * I have added documentation for what has changed in each program. * The new files have copyright headers now. * "make check" succeeds. * "make syntax-check" succeeds, except for a complaint about strftime in code that I

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-03-20 Thread Eric Fischer
Thanks. The paperwork code is [gnu.org #1262124]. The deadline comes from the copyright assignment document, which "applies to all past and future works, made by April 30, 2018, of Developer…." I'm not sure quite how that came to be the agreement between Mapbox and FSF, but that's what I've got to

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-03-17 Thread Eric Fischer
to resolve whatever else needs to be done, so there is some urgency about this from my perspective. Thanks, Eric Fischer

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-01-17 Thread Eric Fischer
Or actually I *won't* necessarily have to change my version of tr, because the real point of this thread isn't to get my own changes accepted, it's to get *some* reasonable multibyte implementation of the utilities, regardless of whose it is, into the standard coreutils distribution. Eric

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-01-17 Thread Eric Fischer
OK, that seems reasonable, since as far as I know, no one implements the POSIX notation for constructing multibyte characters out of adjacent octal escapes anyway, and the standard has already backed off from supporting them in ranges. I'll have to change mine to leave characters decomposed until

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-01-17 Thread Eric Fischer
I am now tracking which of Assaf's tests my implementation passes and fails in https://github.com/ericfischer/coreutils/issues/2. The ones that fail seem to be because: * I have not implemented cut -n * My tr will not remove bytes from the middle of characters * Linux and MacOS disagree about

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-01-17 Thread Eric Fischer
Thanks for the feedback. To clear one thing up at the start: I am not Eric Blake, so the earlier cut -d patch is not mine. Thanks also for clarifying the license requirements. I will follow up with Mapbox legal to find out how we can work with this. Sebastian, I think you may have been testing

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2017-12-30 Thread Eric Fischer
Thanks for the feedback. My changes are now in the "multibyte" branch at https://github.com/ericfischer/coreutils/tree/multibyte branched from the savannah coreutils repository. I've moved my lib changes (all multibyte or wide versions of existing single-byte functions) into a shared file in

Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2017-12-29 Thread Eric Fischer
Hello Coreutils maintainers! I've recently spent some time adding multibyte support to the coreutils text processing tools (sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr) in this repository: https://github.com/ericfischer/coreutils-utf8 I haven't tackled cut -bn yet,