bug#34524: wc: word count incorrect when words separated only by no-break space

2019-03-09 Thread Pádraig Brady
On 09/03/19 05:52, Bruno Haible wrote: > Hi Pádraig, > In regard to options for enabling various behaviors for wc(1), I'm thinking we might keep the strict POSIX isspace() behavior with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() by default > > Since you plan to

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-03-09 Thread Bruno Haible
Hi Pádraig, > >> In regard to options for enabling various behaviors for wc(1), > >> I'm thinking we might keep the strict POSIX isspace() behavior > >> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() > >> by default Since you plan to add a --words=... option in the future (as

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-25 Thread Pádraig Brady
On 24/02/19 19:55, Pádraig Brady wrote: > On 24/02/19 17:07, Pádraig Brady wrote: >> So non break space is generally considered a word delimiter, >> though there are complications you detail from unicode. >> >> In regard to options for enabling various behaviors for wc(1), >> I'm thinking we might

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-24 Thread Pádraig Brady
On 24/02/19 17:07, Pádraig Brady wrote: > So non break space is generally considered a word delimiter, > though there are complications you detail from unicode. > > In regard to options for enabling various behaviors for wc(1), > I'm thinking we might keep the strict POSIX isspace() behavior >

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-24 Thread Pádraig Brady
On 24/02/19 05:58, Bruno Haible wrote: > [Ccing bug-libunistring, because this is about Unicode handling in GNU. The > original thread is in .] > >>> The man page for wc states: "A word is a... sequence of characters >>> delimited by white

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-24 Thread Paul Eggert
Bruno Haible wrote: I would find it best to introduce an option '--unicode' to 'wc', that would produce Unicode compliant results, at the cost of - not following POSIX to the letter, It'd make sense to have an option. How about a more-general option --words, that would let the user define

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-24 Thread Bruno Haible
[Ccing bug-libunistring, because this is about Unicode handling in GNU. The original thread is in .] > > The man page for wc states: "A word is a... sequence of characters > > delimited by white space." > > > > But its concept of white space

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-23 Thread Pádraig Brady
On 18/02/19 00:12, vampyre...@gmail.com wrote: > $ wc --version > wc (GNU coreutils) 8.29 > Packaged by Gentoo (8.29-r1 (p1.0)) > > The man page for wc states: "A word is a... sequence of characters delimited > by white space." > > But its concept of white space only seems to include ASCII

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-22 Thread Bob Proulx
vampyre...@gmail.com wrote: > The man page for wc states: "A word is a... sequence of characters delimited > by white space." > > But its concept of white space only seems to include ASCII white > space. U+00A0 NO-BREAK SPACE, for instance, is not recognized. Indeed this is because wc and

bug#34524: wc: word count incorrect when words separated only by no-break space

2019-02-18 Thread vampyrebat
$ wc --version wc (GNU coreutils) 8.29 Packaged by Gentoo (8.29-r1 (p1.0)) The man page for wc states: "A word is a... sequence of characters delimited by white space." But its concept of white space only seems to include ASCII white space. U+00A0 NO-BREAK SPACE, for instance, is not