I would really like it if file(1) could use the upstream magic files,
but they now use a lot of regex.
Even if I just add the c-lang file, the difference is dramatic,
especially on large files:
$ time file magdir/* /etc/* /bin/* >/dev/null
0m01.05s real 0m00.87s user 0m00.16s system
$ time ./file magdir/* /etc/* /bin/* >/dev/null
0m04.89s real 0m04.63s user 0m00.25s system
$ time file ./post-magic >/dev/null
0m00.54s real 0m00.45s user 0m00.10s system
$ time ./file ./post-magic >/dev/null
0m13.31s real 0m13.20s user 0m00.08s system
If I link with PCRE instead (I am not suggesting we do this!):
$ time ./file magdir/* /etc/* /bin/* >/dev/null
0m00.96s real 0m00.78s user 0m00.17s system
$ time ./file ./post-magic >/dev/null
0m00.25s real 0m00.17s user 0m00.07s system
So the best way to improve our file(1) would probably be to make our
libc regex engine faster...
On Tue, Jan 15, 2019 at 07:27:15AM +0000, Nicholas Marriott wrote:
> Hi
>
> I think I would avoid adding more of these at the moment, especially
> ones that aren't very specific (why is "package" Go and not Java?) and
> for languages that haven't been around very long, unless it is solving a
> specific problem.
>
> Original file has moved these into the magic files and made them more
> sophisticated (Magdir/c-lang), but I doubt our regex code is fast enough
> to get away with this. It is mostly stuff like ^ and leading spaces or
> #s though - perhaps we could make the C searching code better though, I
> just copied what our old file version did. Not sure it is worth it.
>
>
> On Tue, Jan 15, 2019 at 02:00:11AM -0500, Ted Unangst wrote:
> > Matteo Niccoli wrote:
> > > Didn't find any other examples. At the moment rust code is recognized
> > > as ASCII C program text.
> >
> > src/usr.bin/file/text.c has an array of special matches for text.
> >
> > It has various omissions, though.
> >
> > <!doctype html> is matched as SGML.
> > import means Java, but not python or go.
> >
> > etc. I suppose it doesn't hurt to add a few more entries, but every entry
> > slows down file. So we shouldn't go too wild.
> >
> > Anyway, this adds support for go by matching "package". It also removes two
> > entries that result in false positives if they match too soon.
> >
> > Index: text.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/file/text.c,v
> > retrieving revision 1.3
> > diff -u -p -r1.3 text.c
> > --- text.c 18 Apr 2017 14:16:48 -0000 1.3
> > +++ text.c 15 Jan 2019 06:58:36 -0000
> > @@ -31,14 +31,13 @@ static const char *text_words[][3] = {
> > { "import", "Java program", "text/x-java" },
> > { "\"libhdr\"", "BCPL program", "text/x-bcpl" },
> > { "\"LIBHDR\"", "BCPL program", "text/x-bcpl" },
> > - { "//", "C++ program", "text/x-c++" },
> > { "virtual", "C++ program", "text/x-c++" },
> > { "class", "C++ program", "text/x-c++" },
> > { "public:", "C++ program", "text/x-c++" },
> > { "private:", "C++ program", "text/x-c++" },
> > - { "/*", "C program", "text/x-c" },
> > { "#include", "C program", "text/x-c" },
> > { "char", "C program", "text/x-c" },
> > + { "package", "Go program", "text/x-go" },
> > { "The", "English", "text/plain" },
> > { "the", "English", "text/plain" },
> > { "double", "C program", "text/x-c" },
> >