On Fri, 12 Jan 2018, Martin Todorov wrote:
We're working on implementing a new artifact repository manager. Most of
the files in the repositories will be binaries (usually archives such as
jar, war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited
to just these).

Unfortunately, you've picked some of the worst cases there... WAR and JAR are both based on top of ZIP (as is DOCX, XLSX etc). There is no mime magic difference between any of those formats. They tend to be quite large too! Your only two ways to tell them apart are:
 * Give Tika the filename supplied by the user, hope the user didn't
   change the extension, and have Tika specialise the ZIP type into
   jar / war / xlsx based on the filename
 * Give Tika the whole File, Tika opens the zip (or ole2 etc), checks
   what entries there are, checks what's in the odd entry, and reports
   the exact type based on what was found


Could you point us to some working examples of such partial feeding? If anyone is willing to give this a go, we'd be quite happy as we could use all the help! :)

I've put a little bit on the wiki at https://wiki.apache.org/tika/TikaWithoutFiles

Basically though, Tika will make a stab with about the first 1kb of the stream, and will only use the first 64kb of the stream for doing mime-magic based detection. Just save / buffer / mark+reset enough for that, pass it to Tika's DefaultDetector along with the filename (if you trust it), and Tika will do what it can

However, for container formats like Zip-based formats (eg WAR, JAR, DOCX), you really need to give Tika the whole File for accurate detection

Nick

Reply via email to