Hi David,
> > Sunglasses have a width of 1 here, that's why David and I don't see
> > the problem.
>
> I'm surprised that I didn't see the same behavior as Norm, because we
> use the same locale, en_US.utf8. Any idea why?
I'm en_GB.utf8, but I don't see it either. It's the wcwidth(3) answer
for a codepoint, and as Unicode continue to add POO WITH SUNGLASSES, so
the answers change with the version of one's system's database.
$ test/getcwidth --ctype | grep 1f576
1f576 1 -pg--@--
So here it's `print', `graph', and `punct', with a width of 1. Norm's
gang have a width of -1 as they haven't the foggiest what it is.
http://unicode.org/cldr/utility/character.jsp?a=1f576=Show says its
East Asian width is `Neutral', which is treated as `Narrow', so
getcwidth reporting 1 matches.
Nearby is http://unicode.org/cldr/utility/character.jsp?a=1f57a=Show
that says it's `Wide', but here I don't know anything about that yet,
thankfully.
$ test/getcwidth --ctype | grep 1f57a
1f57a -1
One can poke about the local definitions.
$ test/getcwidth --ctype | awk '{print $2}' |
> sort -n | uniq -c
57249 -1
1723 0
29884 1
95464 2
$
$ test/getcwidth --ctype | awk '{print $3}' |
> LC_ALL=C sort | uniq -c
57183
14 -psb
15563 -pg--@--
10 -pg---dxN---
107528 -pgaN---
2167 -pga-l--N---
6 -pga-l-xN---
1772 -pgau---N---
6 -pgau--xN---
4 -pgaul--N---
60 c---
6 c-s-
1 c-sb
$
That says there are four runes that are both upper and lower!
$ printf '%b\n' $(test/getcwidth --ctype |
> awk '$3 ~ /ul/ {print "\\u" $1}')
Dž
Lj
Nj
Dz
$
And here's the first printable zero-width.
$ test/getcwidth --ctype | grep -m1 ' 0 .*p'
ad 0 -pg--@--
U+00AD is soft hyphen. Unicode is said to be an ISO 8859-1 superset,
and U+AD was soft hyphen in that too, but visible, with a width of 1.
ISO used it at the end of the line to show a word had been broken, but
not by the author, allowing it to be stripped on re-formatting. Unicode
changed that. For them, it's a hint from the author to the renderer
that here's a potential point to break the word, thus, when rendered,
it's not visible and has zero width. Toc toc toc!
Terminals get this wrong. libvte-based terminals here think it has
width.
$ s="$(printf '\uad')"
$ scan -format "_%4(lit foo)_\n_%4(lit £)_\n_%4(lit $s)_" .
_foo _
_£ _
__ [Rune after first _ isn't a space.]
$
Dickey's venerable xterm(1) does better.
$ s="$(printf '\uad')"
$ scan -format "_%4(lit foo)_\n_%4(lit £)_\n_%4(lit $s)_" .
_foo _
_£ _
__ [All four are spaces.]
$
--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy
___
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers