Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Nick Burch Sun, 14 Jan 2018 22:22:38 -0800

On Fri, 12 Jan 2018, Martin Todorov wrote:

We're working on implementing a new artifact repository manager. Most of
the files in the repositories will be binaries (usually archives such as
jar, war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited
to just these).

Unfortunately, you've picked some of the worst cases there... WAR and JARare both based on top of ZIP (as is DOCX, XLSX etc). There is no mimemagic difference between any of those formats. They tend to be quitelarge too! Your only two ways to tell them apart are:

 * Give Tika the filename supplied by the user, hope the user didn't
   change the extension, and have Tika specialise the ZIP type into
   jar / war / xlsx based on the filename
 * Give Tika the whole File, Tika opens the zip (or ole2 etc), checks
   what entries there are, checks what's in the odd entry, and reports
   the exact type based on what was found

Could you point us to some working examples of such partial feeding? Ifanyone is willing to give this a go, we'd be quite happy as we could useall the help! :)

I've put a little bit on the wiki athttps://wiki.apache.org/tika/TikaWithoutFiles

Basically though, Tika will make a stab with about the first 1kb of thestream, and will only use the first 64kb of the stream for doingmime-magic based detection. Just save / buffer / mark+reset enough forthat, pass it to Tika's DefaultDetector along with the filename (if youtrust it), and Tika will do what it can

However, for container formats like Zip-based formats (eg WAR, JAR, DOCX),you really need to give Tika the whole File for accurate detection


Nick

Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Reply via email to