bug#20751: wc -m doesn't count UTF-8 characters properly

2015-06-07 Thread Valdis Vītoliņš
Thanks for clarification! I tested it with Bash script: chars=$(wc -m mylog|cut -d ' ' -f1) lines=$(wc -l mylog|cut -d ' ' -f1) let chars=$chars - $lines echo $chars and got the same number as given by vim :%s/.//gn (Which was place from what I got confused.) Hopefully this bug description

bug#20751: wc -m doesn't count UTF-8 characters properly

2015-06-07 Thread Stephane Chazelas
2015-06-06 21:49:16 +0300, Valdis Vītoliņš: Note, that UTF-8 characters can be counted by counting bytes with bit patterns 0xxx or 11xx: https://en.wikipedia.org/wiki/UTF-8#Description So, general logic should be, that, if: a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or b)

bug#20751: wc -m doesn't count UTF-8 characters properly

2015-06-06 Thread Valdis Vītoliņš
Note, that UTF-8 characters can be counted by counting bytes with bit patterns 0xxx or 11xx: https://en.wikipedia.org/wiki/UTF-8#Description So, general logic should be, that, if: a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or b) first two bytes of file are 0xFE 0xFF

bug#20751: wc -m doesn't count UTF-8 characters properly

2015-06-06 Thread Pádraig Brady
tag 20751 notabug close 20751 stop On 06/06/15 19:49, Valdis Vītoliņš wrote: Version: wc (GNU coreutils) 8.21 When 'wc -m' is invoked, it should print character count, but it counts incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 bytes in them, but all have only two UTF-8

bug#20751: wc -m doesn't count UTF-8 characters properly

2015-06-06 Thread Glenn Morris
You mailed submit@debbugs without specifying a Package:, so your bug report ended up on the help-debbugs list. I have reassigned it to coreutils. (Please note there is no wc package.) (My mailer is messing up the UTF-8 characters in your report. Interested parties can see the original at