I would really like it if file(1) could use the upstream magic files,
but they now use a lot of regex.

Even if I just add the c-lang file, the difference is dramatic,
especially on large files:

$ time file magdir/* /etc/* /bin/* >/dev/null
    0m01.05s real     0m00.87s user     0m00.16s system
$ time ./file magdir/* /etc/* /bin/* >/dev/null
    0m04.89s real     0m04.63s user     0m00.25s system

$ time file ./post-magic >/dev/null
    0m00.54s real     0m00.45s user     0m00.10s system
$ time ./file ./post-magic >/dev/null
    0m13.31s real     0m13.20s user     0m00.08s system

If I link with PCRE instead (I am not suggesting we do this!):

$ time ./file magdir/* /etc/* /bin/* >/dev/null
    0m00.96s real     0m00.78s user     0m00.17s system
$ time ./file ./post-magic >/dev/null
    0m00.25s real     0m00.17s user     0m00.07s system

So the best way to improve our file(1) would probably be to make our
libc regex engine faster...


On Tue, Jan 15, 2019 at 07:27:15AM +0000, Nicholas Marriott wrote:
> Hi
> 
> I think I would avoid adding more of these at the moment, especially
> ones that aren't very specific (why is "package" Go and not Java?) and
> for languages that haven't been around very long, unless it is solving a
> specific problem.
> 
> Original file has moved these into the magic files and made them more
> sophisticated (Magdir/c-lang), but I doubt our regex code is fast enough
> to get away with this. It is mostly stuff like ^ and leading spaces or
> #s though - perhaps we could make the C searching code better though, I
> just copied what our old file version did. Not sure it is worth it.
> 
> 
> On Tue, Jan 15, 2019 at 02:00:11AM -0500, Ted Unangst wrote:
> > Matteo Niccoli wrote:
> > > Didn't find any other examples. At the moment rust code is recognized
> > > as ASCII C program text.
> > 
> > src/usr.bin/file/text.c has an array of special matches for text.
> > 
> > It has various omissions, though.
> > 
> > <!doctype html> is matched as SGML.
> > import means Java, but not python or go.
> > 
> > etc. I suppose it doesn't hurt to add a few more entries, but every entry
> > slows down file. So we shouldn't go too wild.
> > 
> > Anyway, this adds support for go by matching "package". It also removes two
> > entries that result in false positives if they match too soon.
> > 
> > Index: text.c
> > ===================================================================
> > RCS file: /cvs/src/usr.bin/file/text.c,v
> > retrieving revision 1.3
> > diff -u -p -r1.3 text.c
> > --- text.c  18 Apr 2017 14:16:48 -0000      1.3
> > +++ text.c  15 Jan 2019 06:58:36 -0000
> > @@ -31,14 +31,13 @@ static const char *text_words[][3] = {
> >     { "import", "Java program", "text/x-java" },
> >     { "\"libhdr\"", "BCPL program", "text/x-bcpl" },
> >     { "\"LIBHDR\"", "BCPL program", "text/x-bcpl" },
> > -   { "//", "C++ program", "text/x-c++" },
> >     { "virtual", "C++ program", "text/x-c++" },
> >     { "class", "C++ program", "text/x-c++" },
> >     { "public:", "C++ program", "text/x-c++" },
> >     { "private:", "C++ program", "text/x-c++" },
> > -   { "/*", "C program", "text/x-c" },
> >     { "#include", "C program", "text/x-c" },
> >     { "char", "C program", "text/x-c" },
> > +   { "package", "Go program", "text/x-go" },
> >     { "The", "English", "text/plain" },
> >     { "the", "English", "text/plain" },
> >     { "double", "C program", "text/x-c" },
> > 

Reply via email to