Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Martin Todorov Thu, 11 Jan 2018 17:13:19 -0800

Hi,

Thanks for getting back to me!


We're working on implementing a new artifact repository manager. Most of
the files in the repositories will be binaries (usually archives such as
jar, war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited
to just these). Unfortunately, there is no guessing what artifacts people
will come up with and want to deploy. Hence our thought was to consider
Tika for this job.

Could you point us to some working examples of such partial feeding? If
anyone is willing to give this a go, we'd be quite happy as we could use
all the help! :)

Kind regards,

Martin




On Thu, Jan 11, 2018 at 8:20 PM, Allison, Timothy B. <[email protected]>
wrote:

> Hi Martin,
>
> I’m sorry for my delay.  As a first pass at an answer…We have roughly
> three mechanisms for file id:
>
>
>
>    1. mime patterns (magic mime)
>    2. package detection
>    3. parse-time sub-type detection
>    4. file name extension (completely useless for your purposes)
>
>
>
>    1. You should be able to use the mime patterns in a buffered single
>    read.  Buffer the first 1024 bytes or so and run our mime detection.
>    2. We are currently opening the zip/package file and looking for
>    particular files within the zip/package files e.g. docx, xlsx…etc,
>    which requires the whole file and cannot be done by our current methods in
>    a streaming fashion.  I don’t see a way around parsing the
>    package/container file
>    3. IIRC, some of our parsers update the mime based on knowledge of
>    that particular format’s subtypes/actually parsing the file (doc, ppt
>    and …?) …so these would be a non-starter.
>
>
>
> Regrettably, AFAIK, at least from a Tika perspective, there is no silver
> bullet.
>
>
>
> Instead of having to spool the complete file to memory (or disk) and then
> run detection (or having Tika do that) for every file, I wonder if you
> could run 1) (mime magic detection) on the stream, and, if that returns
> something obvious, go with that, otherwise spool to disk and then run
> regular Tika on that subset of files.
>
>
>
> Nick Burch will probably have better insight on this than my ramblings
> above.
>
>
>
> *From:* Martin Todorov [mailto:[email protected]]
> *Sent:* Thursday, January 4, 2018 8:48 PM
> *To:* [email protected]
> *Subject:* How to implement an InputStream that dynamically guesses the
> extension of a file that is streamed using Apache Tika?
>
>
>
>
>
> Hi,
>
>
>
> I have asked this on Stackoverflow
> <https://stackoverflow.com/questions/48102004/how-to-implement-an-inputstream-that-dynamically-guesses-the-extension-of-a-file>
>  and
> was pointed here, with the hope that more people would be able to help.
>
>
>
> We have a custom implementation of an InputStream that can currently
> update multiple MessageDigest-s and while reading the data. This allows for
> a single reading and processing of the data and avoids having to re-read
> files in order to calculate their checksums. This is quite efficient and
> saves time (and is implemented in here
> <https://github.com/strongbox/strongbox/blob/9dcb13255512cd396e63f712bb5ce82bb632726c/strongbox-storage/strongbox-storage-core/src/main/java/org/carlspring/strongbox/io/ArtifactInputStream.java>
> ).
>
>
>
> As a follow-up step, we'd like to use Apache Tika to guess the file
> extension from the stream, which is sent over HTTP. I know some of you will
> suggest simply setting the Content-Type header and requiring that it's set,
> but, unfortunately, for various reasons, we cannot rely on this, or enforce
> it. Hence, I'm looking for a way to guess the extension based on the
> InputStream, while it's being sent.
>
>
>
> We also need to be able to guess complex extension types (such as tar.gz,
> tar.bz2 and other similar ones that aren't easy to guess by just doing a
> substring from the last index of the dot until the end of the string).
>
>
>
> What is the most-efficient way to do this? We cannot afford to read the
> whole files in memory, as the application will have to be able to handle a
> large number of concurrent requests. Could somebody please provide an
> example, of how this could be done?
>
>
>
> We have an open issue <https://github.com/strongbox/strongbox/issues/370> and
> a pull request here
> <https://github.com/strongbox/strongbox/pull/468/files#diff-8024b836036b6f5fb567a3ce48c2a4d6R221>,
> if anyone would like to have a closer look and help out.
>
>
>
> Looking forward to your suggestions and replies!
>
> Kind regards,
>
>
>
> Martin Todorov
>
>
>
>
>
>
>

Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Reply via email to