bug#30326: grep not searching through a text file (thinking it binary)

2018-04-20 Thread Paul Eggert

On 02/05/2018 03:38 PM, Paul Eggert wrote:
I was referring to text containing encoding errors without containing 
NULs, which is what this bug report originally was about. Sorry I 
didn't make that clear.


Following up on this (with some delay...), I installed the attached 
patch to try to cover this point more clearly in the grep manual.


From 9904a2bcb099048e5a17bdd6edf6595764911741 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Fri, 20 Apr 2018 15:19:09 -0700
Subject: [PATCH] doc: mention encoding errors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This attempts to document the encoding-error problem more
precisely (Bug#30326).
* doc/grep.in.1, doc/grep.texi: Mention that the behavior of
patterns like ‘.’ is not specified on encoding errors.
---
 doc/grep.in.1 |  6 --
 doc/grep.texi | 40 +---
 2 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/doc/grep.in.1 b/doc/grep.in.1
index 9393b37..ae14e54 100644
--- a/doc/grep.in.1
+++ b/doc/grep.in.1
@@ -744,6 +744,7 @@ may be quoted by preceding it with a backslash.
 The period
 .B .\&
 matches any single character.
+It is unspecified whether it matches an encoding error.
 .SS "Character Classes and Bracket Expressions"
 A
 .I "bracket expression"
@@ -752,12 +753,13 @@ is a list of characters enclosed by
 and
 .BR ] .
 It matches any single
-character in that list; if the first character of the list
+character in that list.
+If the first character of the list
 is the caret
 .B ^
 then it matches any character
 .I not
-in the list.
+in the list; it is unspecified whether it matches an encoding error.
 For example, the regular expression
 .B [0123456789]
 matches any single digit.
diff --git a/doc/grep.texi b/doc/grep.texi
index 922d96e..58caa62 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1016,6 +1016,8 @@ interpreted.
 @vindex LC_ALL @r{environment variable}
 @vindex LC_CTYPE @r{environment variable}
 @vindex LANG @r{environment variable}
+@cindex encoding error
+@cindex null character
 These variables specify the locale for the @env{LC_CTYPE} category,
 which determines the type of characters,
 e.g., which characters are whitespace.
@@ -1023,6 +1025,18 @@ This category also determines the character encoding, 
that is, whether
 text is encoded in UTF-8, ASCII, or some other encoding.  In the
 @samp{C} or @samp{POSIX} locale, all characters are encoded as a
 single byte and every byte is a valid character.
+In more-complex encodings such as UTF-8, a sequence of multiple bytes
+may be needed to represent a character, and some bytes may be encoding
+errors that do not contribute to the representation of any character.
+POSIX does not specify the behavior of @command{grep} when patterns or
+input data contain encoding errors or null characters, so portable
+scripts should avoid such usage.  As an extension to POSIX, GNU
+@command{grep} treats null characters like any other character.
+However, unless the @option{-a} (@option{--binary-files=text}) option
+is used, the presence of null characters in input or of encoding
+errors in output causes GNU @command{grep} to treat the file as binary
+and suppress details about matches.  @xref{File and Directory
+Selection}.
 
 @item LANGUAGE
 @itemx LC_ALL
@@ -1187,16 +1201,16 @@ are regular expressions that match themselves.
 Any meta-character
 with special meaning may be quoted by preceding it with a backslash.
 
-A regular expression may be followed by one of several
-repetition operators:
-
-@table @samp
-
-@item .
 @opindex .
 @cindex dot
 @cindex period
 The period @samp{.} matches any single character.
+It is unspecified whether @samp{.} matches an encoding error.
+
+A regular expression may be followed by one of several
+repetition operators:
+
+@table @samp
 
 @item ?
 @opindex ?
@@ -1267,11 +1281,15 @@ An unmatched @samp{)} matches just itself.
 @cindex character class
 A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and
 @samp{]}.
-It matches any single character in that list;
-if the first character of the list is the caret @samp{^},
-then it matches any character @strong{not} in the list.
+It matches any single character in that list.
+If the first character of the list is the caret @samp{^},
+then it matches any character @strong{not} in the list,
+and it is unspecified whether it matches an encoding error.
 For example, the regular expression
-@samp{[0123456789]} matches any single digit.
+@samp{[0123456789]} matches any single digit,
+whereas @samp{[^()]} matches any single character that is not
+an opening or closing parenthesis, and might or might not match an
+encoding error.
 
 @cindex range expression
 Within a bracket expression, a @dfn{range expression} consists of two
@@ -1856,7 +1874,7 @@ On some operating systems that support files with 
holes---large
 regions of zeros that are not physically present on secondary
 storage---@command{grep} can skip over the holes eff

bug#30326: grep not searching through a text file (thinking it binary)

2018-02-05 Thread Paul Jackson

Paul Eggert wrote:
>>  I was referring to text containing encoding errors without
>>  containing NULs
Ah - that makes sense.

The following experiment leads me to conclude that grep entirely
suppressesemitting any portion of a match that would contain an encoding
error, ratherthan emitting some substring of the match that can be correctly 
encoded.
That is, it seems that if grep is asked to emit what it thinks
would be amatch with an encoding error, grep seems to suppress that output line
entirely, and continues looking for matches that it can emit
without encodingerrors, and then at the end, if it saw a match that would have
emitted anencoding error, it issues the "*Binary file ... matches*" error, just
before exiting (or ending processing of that particular file.)

I demonstrated this by replacing the ELF executable of my previous
example withthe output of the following C program, which issues every possible
pair of bytes,except for no nul and no 255 bytes:

*main()**
*
*{**
*
*int i, j;**
*
*for (i = 1; i < 255; i++) {**
*
*for (j = 1; j < 255; j++)**
*
*printf("%c%c", i, j);**
*
*}**
*
*puts("");**
*
*}**
*

So I tested on a file (*/tmp/pjcc*) containing (1) a bunch of
ASCII C code,(2) output from the above program, and (3) another copy of the same
ASCII C code.
Then, with the following settings:

*LC_COLLATE=C**
*
*LANGUAGE=en_US.UTF-8**
*
*LC_ALL=en_US.UTF-8**
*
*LANG=en_US.UTF-8**
*

I ran the command:

*grep "'N'" /tmp/pjcc**
*

I got the following output:

* case 'N':**
*
* case 'N':**
*
*Binary file /tmp/pjcc matches**
*

The "*case 'N':*" string appears once in the C code used in the file,
butthere are two copies of that C code in the file, so that grep prints
that line twice.
I also double checked that my file */tmp/pjcc* did not contain any
nul bytes.
The three character sequence *'N'* also appears in the middle section ofall 
non-nul, non-255 pairs of bytes, as well as in the ASCII C code, andit was (I 
presume) the match on that section of the file that
caused grepto issue the ""*Binary file /tmp/pjcc matches* complaint at the
end of its processing of that file.

If on the other hand, I ran the command:

*grep "'N':" /tmp/pjcc*

then I got the output:

* case 'N':*
* case 'N':*

with*_out_* any complaint that the *Binary file /tmp/pjcc matches.*

The four character sequence *'N':*  appears (twice) in the C code,
but zero times in the middle section of all non-nul, non-255
pairs of bytes.
>From this I conclude that if grep, in its default mode, is asked to emit
a matchingpattern that would contain encoding errors, that it does not trim the
output to whatwould encode correctly and continue onward, but rather emits 
nothing for
that match,continues onward looking for more matches that it can emit
correctly, and thenprints the "*Binary file ... matches*" error just before it 
exits or
goes to thenext file.

If I were designing grep from scratch, and had infinite resources, I
might refer tohave grep emit some substring of each match that it can encode
correctly, ratherthan emit nothing in case of an encoding error.

However, I can't imagine that this is worth the effort, and
(being a stickin the mud old fart) I usually recommend against incompatible 
changes
unless strongly necessary.

So ... whatever ... nevermind ... as they say.

--
Paul Jackson
p...@usa.net



bug#30326: grep not searching through a text file (thinking it binary)

2018-02-05 Thread Paul Eggert

On 02/05/2018 01:27 PM, Paul Jackson wrote:


I created a large file ("/tmp/pjbb")  by concatenating:
1) a big plain ASCII file of C source code,
2) a small ELF executable, and
3) another big plain ASCII file of C source code.

Then I grep'd in this big file for the string "p...@usa.net 
", which

appeared twice in  the first file of C source code,  and once
again in the second file of C source code.


That example contains NULs, which have indicated binary data for ages. I 
was referring to text containing encoding errors without containing 
NULs, which is what this bug report originally was about. Sorry I didn't 
make that clear.







bug#30326: grep not searching through a text file (thinking it binary)

2018-02-05 Thread Paul Jackson

Paul Eggert wrote, in response to my suggestion to filter grep output,
not input, for "binary junk":>> We've done that already, if memory serves.

I don't think so :).

The installed grep on the system I'm typing on right now is "grep (GNU
grep) 3.0".I've not checked closely, but I believe that should be a fairly
recent grep.
I created a large file ("/tmp/pjbb")  by concatenating:
1) a big plain ASCII file of C source code,
2) a small ELF executable, and
3) another big plain ASCII file of C source code.

Then I grep'd in this big file for the string "p...@usa.net", which
appeared twice in  the first file of C source code,  and once
again in the second file of C source code.

Here's what I see:


*$* grep --version | head -1
grep (GNU grep) 3.0

*$* grep p...@usa.net /tmp/pjbb
* p...@usa.net
* p...@usa.net
Binary file /tmp/pjbb matches

*$* grep -a p...@usa.net /tmp/pjbb
* p...@usa.net
* p...@usa.net
* p...@usa.net


By default, grep sees the first two "p...@usa.net",
then abandons the search before seeing the third
such, when it first encounters the ELF binary.

Using "grep -a" to ask grep to persist, it sees all
three "p...@usa.net" strings.

===

My ancient home-brew hack that provides ASCII trimmed
output when scanning binary files for ASCII strings, contains
custom code to buffer the already scanned input, in order
that it can then scan backwards, once it finds a match.

The usual line oriented buffering doesn't work so well when
the input file might have no, or at least infrequent, line breaks.

--
Paul Jackson
    p...@usa.net


bug#30326: grep not searching through a text file (thinking it binary)

2018-02-05 Thread Paul Eggert

On 02/05/2018 08:05 AM, Paul Jackson wrote:

If one goal of the  current grep behavior is to avoid putting out
"junk" unexpectedly, then instead of rejecting input files that
have any such "junk", rather happily grep on any dang file, by
default, but then filter the output to suppress the "junk".


We've done that already, if memory serves.






bug#30326: grep not searching through a text file (thinking it binary)

2018-02-05 Thread Paul Jackson

A couple of possible "solutions" to this quandary:

===

If one goal of the  current grep behavior is to avoid putting out
"junk" unexpectedly, then instead of rejecting input files that
have any such "junk", rather happily grep on any dang file, by
default, but then filter the output to suppress the "junk".

For many years now, I've been using my own private mutant
semi-brain damaged grep-variant that I use for searching for
text within mostly binary files that does this ... it will look for
any specified sequence of non-nul bytes  within any bucket of
bits, and when found, work forward and backward until it hits
either a newline or a non-ASCII character, and then limit it's
output to what is between those beginning and ending points. 
No non-ASCII junk will be output (except in so far as that was
part of the requested search string.)   My private mutant only
does fixed strings (grep -F equivalent), but I imagine that the
same trimming of output could be done on a real grep as well. 

Since "grep" is commonly used in shell scripts, I name my mutant
by some other name, and let "grep" continue to be whatever is
the current convention.

In short, if the goal is to not output "junk", then perhaps that is
what the current "grep" should do, rather than rejecting
from even considering everything in a file after it encounters
any "junk" character (even if it has already successfully found
and emitted some matches earlier in the file.)

===

Second possibility: keep one's own private copy of whatever
grep last performed as desired, in a "bin" that's on one's path
ahead of whatever "standard" and "current" grep is installed.

For many years now, I've continued to use the "ed" command
that was current back then (with a couple of my own hacks),
in preference to the current evolving ed.  Since "ed" is seldom
used within shell scripts, and when so used, is never that I've
noticed used in a way that depends on which version of "ed"
is used, I don't need to rename my preferred, archaic, "ed".

But, perhaps L. A. Walsh might choose to do with "grep" as I have
done with "ed" ... put an old version ahead of the current version
on $PATH.

(wave to "law" ... hope you're doing well.)

-- 
Paul Jackson
p...@usa.net





bug#30326: grep not searching through a text file (thinking it binary)

2018-02-04 Thread Paul Eggert

L A Walsh wrote:


I didn't care


Some users do care: they don't want grep to output binary junk that may mess up 
their screen.



Problem is on a mailbox, different emails can have different encodings.


There's no general solution to that problem. No matter what grep does, it will 
mishandle some cases. At best the user will get an approximation to what is 
really wanted. And there will be some cases where grep's default behavior (no 
matter what the default is) will do the "wrong" thing.






bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread L A Walsh



Paul Eggert wrote:

 On 02/02/2018 03:30 PM, L A Walsh wrote:
> most computer files (vs. user-files) are still single-byte.

 That's because so many of them are ASCII. But ASCII files are not the
 issue here. grep's behavior hasn't changed when operating on ASCII files
 in typical locales. The issue is text using a non-ASCII encoding that is
 not compatible with your locale; e.g., if your text file uses ISO 8859-1
 but your locale specifies UTF-8.


   I've had my locale as UTF-8 since around 2000.  My music collection
needed french, english, middle east, and now japanese chars -- so I set 
things

to UTF-8.  I didn't need perfection.  For the email, I needed to know what
files the text was in so I could look at those mbox's with a mail-reader
or with a text editor.  I needed grep to work as a 1st level search tool.
It's failed on that score.

Still if it just searched for the bytes that I put in the search string, I'm
not sure how it would "go wrong".




 In my experience, UTF-8 has long been winning this battle, in the sense
 that UTF-8 is by far the dominant encoding for the non-ASCII files I
 regularly use. So I use a UTF-8 locale, and suggest this as a good
 default for most users nowadays.

 It's not possible to get direct statistics about encoding for all user
 files. However, we can see what's being published on the web. Currently
 UTF-8 is being used by about 90% of public websites whose character
 encoding can be determined, according to the latest W3Techs survey. ISO
 8859-1 is in second place, at about 4%. See:

 https://w3techs.com/technologies/overview/character_encoding/all


Whereas this one was:
Domain: Non-ISO extended-ASCII text, with very long lines

So theoretically, it would never match any locale.

Problem is on a mailbox, different emails can have different encodings.

But I didn't care -- I typed in an ascii string -- so let it search in 
octets

w/no encoding.

It's also such that in a mailbox it's very likely there are going to
be lines (maybe "very long lines"), but the text I was searching for
was <80 chars.

I'm really surprised it was decided to break compat -- as I've been
doing searches like this for over 2 decades - not often, mind you, but
it's one of the big advantages for me of keeping mailboxes for my IMAP
server in mbox format.  Maildir format or others would kill search ability
with slow file-IO.  ;^/








bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread Paul Eggert

On 02/02/2018 03:30 PM, L A Walsh wrote:
most computer files (vs. user-files) are still single-byte. 


That's because so many of them are ASCII. But ASCII files are not the 
issue here. grep's behavior hasn't changed when operating on ASCII files 
in typical locales. The issue is text using a non-ASCII encoding that is 
not compatible with your locale; e.g., if your text file uses ISO 8859-1 
but your locale specifies UTF-8.


In my experience, UTF-8 has long been winning this battle, in the sense 
that UTF-8 is by far the dominant encoding for the non-ASCII files I 
regularly use. So I use a UTF-8 locale, and suggest this as a good 
default for most users nowadays.


It's not possible to get direct statistics about encoding for all user 
files. However, we can see what's being published on the web. Currently 
UTF-8 is being used by about 90% of public websites whose character 
encoding can be determined, according to the latest W3Techs survey. ISO 
8859-1 is in second place, at about 4%. See:


https://w3techs.com/technologies/overview/character_encoding/all






bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread L A Walsh



Paul Eggert wrote:

On 02/02/2018 03:16 PM, L A Walsh wrote:
  

It also used to be the default.



Single-byte locales also used to be the default. Times have changed, and 
things have gotten more complicated. We don't change default behavior 
for no reason, but we also don't keep the default the same even when the 
world has changed and another default behavior would typically be better.
  

But most computer files (vs. user-files) are still single-byte.

Even UTF-8 is mostly single byte, though treating it as a text-stream
works most of the time -- unless the user specifies a character above
0x7e.

I more often use a search in a GUI for user-based files.







bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread Paul Eggert

On 02/02/2018 03:16 PM, L A Walsh wrote:

It also used to be the default.


Single-byte locales also used to be the default. Times have changed, and 
things have gotten more complicated. We don't change default behavior 
for no reason, but we also don't keep the default the same even when the 
world has changed and another default behavior would typically be better.







bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread L A Walsh



Paul Eggert wrote:

On 02/02/2018 12:09 PM, L A Walsh wrote:
  

Grep was able to find text strings in mboxes without a POSIX
definition telling it that it was "broken".



It's not a question of POSIX telling us what to do. It's a question of 
what is a good thing for GNU grep to do, and making sure that this 
behavior conforms to POSIX (at least if POSIXLY_CORRECT is set).
  

In this case it is not.
When grep encounters binary data, there are different "good" things to 
do depending on the application, so grep has options. The behavior 
you're asking for is available as an option. 

It also used to be the default.  I still don't want it to search through a
core or executable if they happened to be in the same directory.  But
email is organized in lines -- and I don't think I've ever had it
spew binary out to my screen (for an email search).

(i.e. I want it to work as it used to work, pre-posix, but still
filtering out binary files.

In this case "file" is able to determine that it is a text
file.  Grep used to get it right after the option was added to
skip binary files, but before it had to be well-formed posix text.

FWIW, grep does handle at least 1 "binary case" -- when last line
doesn't have a linefeed -- something that some would like to believe
indiates binary -- but grep still handles that as text resulting in
some differing output when piped through "wc".


As I understand it, the 
main point of your bug report is that you want the option to be the 
default behavior. However, that would adversely affect some other common 
uses of grep and it's not clear that it's a good idea.


  






bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread Paul Eggert

On 02/02/2018 12:09 PM, L A Walsh wrote:

Grep was able to find text strings in mboxes without a POSIX
definition telling it that it was "broken".


It's not a question of POSIX telling us what to do. It's a question of 
what is a good thing for GNU grep to do, and making sure that this 
behavior conforms to POSIX (at least if POSIXLY_CORRECT is set).


When grep encounters binary data, there are different "good" things to 
do depending on the application, so grep has options. The behavior 
you're asking for is available as an option. As I understand it, the 
main point of your bug report is that you want the option to be the 
default behavior. However, that would adversely affect some other common 
uses of grep and it's not clear that it's a good idea.







bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread L A Walsh

Grep was around long before POSIX, as were most of the unix
utils.

Grep was able to find text strings in mboxes without a POSIX
definition telling it that it was "broken". 


I don't want it displaying random binary that throws my
terminal into weird modes, which is why I skip binary
files. To have grep searching through some mailboxes
while skipping others, randomly based on what email
happens to be in the box at the time, is hardly a useful
utility.

I did not ask for POSIXLY_CORRECT -- if you need to have it be
POSIXLY Correct, then use the existing var, but grep is now
broken -- since POSIX doesn't define "text" files "out in the real
world", but only for files that adhere to the POSIX standard.

People don't write emails that adhere to the POSIX standard.

Also, FWIW, grep's manpage doesn't say it is limited to posix-only
files.  It's summary says:
  grep, egrep, fgrep - print lines matching a pattern

which it does not do.  It doesn't say "print lines matching
a pattern only from POSIX text files.



Eric Blake wrote:

tag 30326 notabug
thanks

On 02/02/2018 01:30 PM, L. A. Walsh wrote:
  

I've used grep to search through my mbox-format emails for decades, but
I've run into a case where it seems to be ignore a text mailbox
because, I guess, it thinks it is "binary"



Yes, that's correct.

  

If I used "-Par" it finds it.



Yes, that's also correct.

  

It seems that grep believes the file to binary and ignores it, though
"file" calls it "text".



The file is conditionally text.  The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent!  So a file that is text under
one locale may be binary under another.  When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways.  As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors.  Hence the warning that your file is binary,
under the current locale.

You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).

This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker.  However, feel free to add further comments or
questions to the thread.

And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.

  






bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread Eric Blake
tag 30326 notabug
thanks

On 02/02/2018 01:30 PM, L. A. Walsh wrote:
> I've used grep to search through my mbox-format emails for decades, but
> I've run into a case where it seems to be ignore a text mailbox
> because, I guess, it thinks it is "binary"

Yes, that's correct.

> If I used "-Par" it finds it.

Yes, that's also correct.

> 
> It seems that grep believes the file to binary and ignores it, though
> "file" calls it "text".

The file is conditionally text.  The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent!  So a file that is text under
one locale may be binary under another.  When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways.  As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors.  Hence the warning that your file is binary,
under the current locale.

You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).

This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker.  However, feel free to add further comments or
questions to the thread.

And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#30326: grep not searching through a text file (thinking it binary)

2018-02-02 Thread L. A. Walsh

I've used grep to search through my mbox-format emails for decades, but
I've run into a case where it seems to be ignore a text mailbox
because, I guess, it thinks it is "binary" (I think ignoring binary
is a default in my aliases file).

I used:


 grep -Pr 'Game:\s+NCSOFT' *


and it ignored a mailbox named 'Domain': that contained the
string:
"=E2=80=A2=09Game: NCSOFT"


 file Domain

Domain: Non-ISO extended-ASCII text, with very long lines


If I used "-Par" it finds it.

It seems that grep believes the file to binary and ignores it, though
"file" calls it "text".

Any ideas?

grep -V
grep (GNU grep) 2.21.31-adf9

Maybe grep is being a bit overzealous in calling files 'binary'?