I appreciate you have a much stronger opinion on this than I do :-)
> However, I doubt anyone who hasn't
> already been bitten by this issue would ever expect this:
anyone who is using regex ranges in a non-POSIX locale should be aware
of this, it has been like this for quite a while now and is mostly
consistent across tools like bash and grep.
> The problem is that collation is meant for collation; it's unsuitable for
> range matching.
That is your opinion, but you'd have to change the POSIX standard for
it being the official view[1]:
"The LC_COLLATE category provides a collation sequence definition for
[...] regular expression matching"
> Other examples where you want ranges by codepoint order, just off the top of
> my head: '[ぁ-ヾヲ-゚]' (incomplete) to match Japanese hiragana and katakana;
I think these examples are good examples for *not* using codepoint
ranges -- in these case you'd rather want to use Unicode character
classes like \p{InHiragana} etc.These are supported by perl regexes at
least, I'm not sure of the current status in grep.
Either way, this is certainly a fundamental issue that is not
appropriate for changes specifically to Ubuntu. I'd therefore
recommend to raise your concerns e.g. in the grep or libc mailing
lists.
[1]http://pubs.opengroup.org/onlinepubs/9699919799/ -- 7.3.2 LC_COLLATE
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/754272
Title:
Range matching incorrect in UTF-8
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/754272/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs