I think what you're seeing is that Tika selects a parser based not only on exact mime match but then super type. So, if there's no parser that claims that it parses json, Tika sees that json is a subtype of javascript, which is a subtype of text, so json will be parsed by the parser that handles text. Theoretically, we could improve the wiki page to include super types, but, again, you'd have to rely on us maintaining that list with each release, and you'd have to update your application. I would not want to be responsible for maintaining an application with an "include" listing or "exclude" listing for which file types to send to Tika to parse. We've just added ~20 new mime types over the last few weeks, for example.
If you know that certain parsers behave badly, it is probably a good idea to turn them off. The underlying MP4 parser that we used to depend on used to be, um, flaky. So, I recommended to some users that didn't care about that format to turn off that parser entirely. We've since updated the underlying parser for that. Mime detection is tricky. If you have tika-parsers-standard on your path, then container detection will happen, and depending on how you load the file, Tika might load the OLE2 package (for example) into memory. If you are able to accept coarser grained mime detection (can't tell diff between OLE2 files, for example...e.g. .doc, .ppt...), you can use only tika-core, and that will not do container detection. In short, I'd encourage isolating Tika either via tika-server or tika-pipes so that it can fail and not bring down your app. If you know certain parsers behave badly, turn them off entirely. On Tue, Jun 20, 2023 at 7:25 AM Neha Kamat via user <[email protected]> wrote: > Hi team, > > > > I am currently working on an application wherein I would like to whitelist > the filetypes supported by TIKA And discard rest of the files to avoid > unknown behaviour/memory leaks. I am currently referring to > https://cwiki.apache.org/confluence/display/TIKA/File+Types+and+Dependencies. > But, when I used json, log files, I see that the content is getting > extracted even when it is not listed under the confluence. Is file > extension list mentioned under this confluence for standard package > complete or it is partial? > > > > Also, I came across a function which list down supported MIME types for a > particular parser. How would this approach behave if I submit > untrusted/unsupported file type to TIKA for parser and supported MIME types > detection? Would it try to load file contents in memory? Would there be a > chance of memory leak when we try to just detect MIME type of a file using > TIKA detect method? > > > > Thanks, > > Neha > > > > > > > > > > >
