I think what you're seeing is that Tika selects a parser based not only on
exact mime match but then super type.  So, if there's no parser that claims
that it parses json, Tika sees that json is a subtype of javascript, which
is a subtype of text, so json will be parsed by the parser that handles
text.  Theoretically, we could improve the wiki page to include super
types, but, again, you'd have to rely on us maintaining that list with each
release, and you'd have to update your application.  I would not want to be
responsible for maintaining an application with an "include" listing or
"exclude" listing for which file types to send to Tika to parse.  We've
just added ~20 new mime types over the last few weeks, for example.

If you know that certain parsers behave badly, it is probably a good idea
to turn them off.  The underlying MP4 parser that we used to depend on used
to be, um, flaky.  So, I recommended to some users that didn't care about
that format to turn off that parser entirely.  We've since updated the
underlying parser for that.

Mime detection is tricky.  If you have tika-parsers-standard on your path,
then container detection will happen, and depending on how you load the
file, Tika might load the OLE2 package (for example) into memory.  If you
are able to accept coarser grained mime detection (can't tell diff between
OLE2 files, for example...e.g. .doc, .ppt...), you can use only tika-core,
and that will not do container detection.

In short, I'd encourage isolating Tika either via tika-server or tika-pipes
so that it can fail and not bring down your app.  If you know certain
parsers behave badly, turn them off entirely.



On Tue, Jun 20, 2023 at 7:25 AM Neha Kamat via user <[email protected]>
wrote:

> Hi team,
>
>
>
> I am currently working on an application wherein I would like to whitelist
> the filetypes supported by TIKA And discard rest of the files to avoid
> unknown behaviour/memory leaks. I am currently referring to
> https://cwiki.apache.org/confluence/display/TIKA/File+Types+and+Dependencies.
> But, when I used json, log files, I see that the content is getting
> extracted even when it is not listed under the confluence. Is file
> extension list mentioned under this confluence for standard package
> complete or it is partial?
>
>
>
> Also, I came across a function which list down supported MIME types for a
> particular parser. How would this approach behave if I submit
> untrusted/unsupported file type to TIKA for parser and supported MIME types
> detection? Would it try to load file contents in memory? Would there be a
> chance of memory leak when we try to just detect MIME type of a file using
> TIKA detect method?
>
>
>
> Thanks,
>
> Neha
>
>
>
>
>
>
>
>
>
>
>

Reply via email to