Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-16 Thread Paul Eggert
Paul Eggert wrote: Attached are some proposed patches which should improve the performance of grep -P when applied to binary files, among other things. I have some other ideas for boosting performance further but thought I'd publish these first. I pushed those patches, along with the attached

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-14 Thread Paul Eggert
Attached are some proposed patches which should improve the performance of grep -P when applied to binary files, among other things. I have some other ideas for boosting performance further but thought I'd publish these first. Please give them a try if you have the time. I doubt whether

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-11 20:26:12 -0700, Paul Eggert wrote: Vincent Lefevre wrote: ypig% LC_ALL=C locale charmap ANSI_X3.4-1968 That may be what the 'locale' command says, but bytes with the top bit on are considered to be valid single-byte characters. There are no encoding errors. So, in that

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Paul Eggert
Vincent Lefevre wrote: Glibc regards it as ASCII: You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work the way that I thought, though. Plus, in GNU regular expressions the pattern . works the way that I thought with LC_ALL=C; my guess (without investigating this) is that

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 09:16:45 -0700, Paul Eggert wrote: Vincent Lefevre wrote: I just mean that grep . is a method given by some people, that was working before UTF-8. And it still works, if by . one means match one character. No, by working, I mean that grep . was matching any non-empty line. A

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Paul Eggert
On 09/12/2014 02:29 PM, Vincent Lefevre wrote: an option to control what happens on encoding errors would be better and sufficient. It might suffice for your use cases, but it's more complicated and less flexible than being able to match bytes within the regular expression. (Plus, someone

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Jim Meyering
On Fri, Sep 12, 2014 at 2:39 PM, Paul Eggert egg...@cs.ucla.edu wrote: On 09/12/2014 02:29 PM, Vincent Lefevre wrote: an option to control what happens on encoding errors would be better and sufficient. It might suffice for your use cases, but it's more complicated and less flexible than

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 14:39:35 -0700, Paul Eggert wrote: On 09/12/2014 02:29 PM, Vincent Lefevre wrote: an option to control what happens on encoding errors would be better and sufficient. It might suffice for your use cases, but it's more complicated and less flexible than being able to match

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Paul Eggert
Vincent Lefevre wrote: I wonder whether anyone is interested in matching individual bytes in a file regarded as UTF-8 encoded. This seems weird. It's not weird at all. For example, suppose we invent the notation [[:error:]] to match encoding errors. Then the pattern '[[:error:]]' would

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Vincent Lefevre
On 2014-09-12 17:57:39 -0700, Paul Eggert wrote: Currently, for example, the tz package http://www.iana.org/time-zones has a Make rule 'check_character_set' that verifies that the source files are all properly encoded. It executes this shell command: ! grep -nv '^.*$' file names This

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Paul Eggert
Vincent Lefevre wrote: But both of these solutions have the drawback of working only in UTF-8 locales. Not at all; '[[:error:]]' would match a single-byte encoding error in the current locale. The tz database is interested in UTF-8 so it sets the LC_ALL environment variable to a UTF-8

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-12 Thread Paul Eggert
Come to think of it, grep -P misbehaves badly in multibyte locales that are not UTF-8. It should report an error and exit rather than output gibberish. I installed the attached patch to catch that. From cac91e3e233b769d60d7b5d6bc0e8afc67c0c713 Mon Sep 17 00:00:00 2001 From: Paul Eggert

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-11 Thread Paul Eggert
Vincent Lefevre wrote: the C locale corresponds to ANSI_X3.4-1968, No it doesn't, at least not on any current platform I'm aware of. And POSIX does not require that. POSIX even allows the C locale to be multibyte, e.g., UTF-8. I would say that this should be the same for invalid byte

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-11 Thread Vincent Lefevre
On 2014-09-11 18:16:29 -0700, Paul Eggert wrote: Vincent Lefevre wrote: the C locale corresponds to ANSI_X3.4-1968, No it doesn't, at least not on any current platform I'm aware of. It does on Debian: ypig% LC_ALL=C locale charmap ANSI_X3.4-1968 I would say that this should be the same

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

2014-09-11 Thread Paul Eggert
Vincent Lefevre wrote: ypig% LC_ALL=C locale charmap ANSI_X3.4-1968 That may be what the 'locale' command says, but bytes with the top bit on are considered to be valid single-byte characters. There are no encoding errors. So, in that sense it's not strict ASCII. the current behavior