On 2014-09-01 01:31:53 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >If there are many invalid UTF8 bytes, this would be slow, IMHO > > That's OK. We don't need grep -P to be fast on invalid input.
I can see a too important slowdown in practical cases. > >But is the copy of the buffer really needed? Couldn't the invalid > >UTF8 sequences just be replaced by null bytes? > > I'd rather not, because that changes the semantics of matching. The null > byte is valid input data that might get matched. It appears that the current behavior in UTF-8 is incorrect, even without -P. For instance: $ printf 'tr\xe8s\n' > text $ grep 'tr.s' text $ LC_ALL=C grep 'tr.s' text tr<E8>s There's no reason that '.' matches something that doesn't belong to the charset in C locale, but doesn't match in a UTF-8 locale. The pattern tr.s is used here to match the French word "très" in files that could be encoded in ISO-8859-1 or UTF-8 locales. In the past, before using UTF-8 locales, I was doing something like: grep -E 'tr..?s' text to match both encodings, and this worked (I could get false positives, but anyway, one is often not interested in all the real grep matches in practice, so that even when knowing the encoding, one was already getting false positives). It's annoying that now in UTF-8, one can no longer match ISO-8859-1 text, and doing a pre-conversion would take too much time. Concerning binary files, I've never wanted to differentiate explicitly null bytes and invalid UTF-8 sequences: IMHO, this is just garbage. There are obviously no differences with patterns like 'some_word' or 'foo[0-9]*bar', but when I use a pattern like 'foo.bar' or 'foo.*bar', I can see two valid reasons to handle these sequences in a similar way with '.': 1. One may want to match "valid" (often in the sense "printable", in the specified encoding) but unknown characters. 2. One may also want to match garbage (including null bytes, and also bytes that do not have any meaning in the charset), with the drawback that if the garbage contains a newline character, this won't work. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org