Paul Eggert wrote:
Attached are some proposed patches which should improve the performance
of grep -P when applied to binary files, among other things. I have
some other ideas for boosting performance further but thought I'd
publish these first.
I pushed those patches, along with the attached
Attached are some proposed patches which should improve the performance
of grep -P when applied to binary files, among other things. I have
some other ideas for boosting performance further but thought I'd
publish these first. Please give them a try if you have the time. I
doubt whether
On 2014-09-11 20:26:12 -0700, Paul Eggert wrote:
Vincent Lefevre wrote:
ypig% LC_ALL=C locale charmap
ANSI_X3.4-1968
That may be what the 'locale' command says, but bytes with the top bit on
are considered to be valid single-byte characters. There are no encoding
errors. So, in that
Vincent Lefevre wrote:
Glibc regards it as ASCII:
You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work
the way that I thought, though. Plus, in GNU regular expressions the
pattern . works the way that I thought with LC_ALL=C; my guess
(without investigating this) is that
On 2014-09-12 09:16:45 -0700, Paul Eggert wrote:
Vincent Lefevre wrote:
I just mean that grep . is a method given by some people, that
was working before UTF-8.
And it still works, if by . one means match one character.
No, by working, I mean that grep . was matching any non-empty
line. A
On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
an option to control what happens on encoding errors would be better
and sufficient.
It might suffice for your use cases, but it's more complicated and less
flexible than being able to match bytes within the regular expression.
(Plus, someone
On Fri, Sep 12, 2014 at 2:39 PM, Paul Eggert egg...@cs.ucla.edu wrote:
On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
an option to control what happens on encoding errors would be better and
sufficient.
It might suffice for your use cases, but it's more complicated and less
flexible than
On 2014-09-12 14:39:35 -0700, Paul Eggert wrote:
On 09/12/2014 02:29 PM, Vincent Lefevre wrote:
an option to control what happens on encoding errors would be
better and sufficient.
It might suffice for your use cases, but it's more complicated and less
flexible than being able to match
Vincent Lefevre wrote:
I wonder whether anyone is interested in matching individual bytes
in a file regarded as UTF-8 encoded. This seems weird.
It's not weird at all. For example, suppose we invent the notation
[[:error:]] to match encoding errors. Then the pattern '[[:error:]]'
would
On 2014-09-12 17:57:39 -0700, Paul Eggert wrote:
Currently, for example, the tz package http://www.iana.org/time-zones has
a Make rule 'check_character_set' that verifies that the source files are
all properly encoded. It executes this shell command:
! grep -nv '^.*$' file names
This
Vincent Lefevre wrote:
But both of these solutions have the drawback of working only in
UTF-8 locales.
Not at all; '[[:error:]]' would match a single-byte encoding error in
the current locale. The tz database is interested in UTF-8 so it sets
the LC_ALL environment variable to a UTF-8
Come to think of it, grep -P misbehaves badly in multibyte locales that
are not UTF-8. It should report an error and exit rather than output
gibberish. I installed the attached patch to catch that.
From cac91e3e233b769d60d7b5d6bc0e8afc67c0c713 Mon Sep 17 00:00:00 2001
From: Paul Eggert
Vincent Lefevre wrote:
the C locale corresponds to ANSI_X3.4-1968,
No it doesn't, at least not on any current platform I'm aware of. And
POSIX does not require that. POSIX even allows the C locale to be
multibyte, e.g., UTF-8.
I would say that this should be the same for invalid
byte
On 2014-09-11 18:16:29 -0700, Paul Eggert wrote:
Vincent Lefevre wrote:
the C locale corresponds to ANSI_X3.4-1968,
No it doesn't, at least not on any current platform I'm aware of.
It does on Debian:
ypig% LC_ALL=C locale charmap
ANSI_X3.4-1968
I would say that this should be the same
Vincent Lefevre wrote:
ypig% LC_ALL=C locale charmap
ANSI_X3.4-1968
That may be what the 'locale' command says, but bytes with the top bit
on are considered to be valid single-byte characters. There are no
encoding errors. So, in that sense it's not strict ASCII.
the current behavior
15 matches
Mail list logo