Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Martin Todorov Thu, 11 Jan 2018 17:35:11 -0800

Hi Maxim,

I would prefer to be able to do it in-memory with some buffered parts of
the file, while reading the stream, if possible...


Kind regards,

Martin




On Fri, Jan 12, 2018 at 1:26 AM, Maxim Solodovnik <[email protected]>
wrote:

> Hello Martin,
>
> using Tika you can guess mime type (we are using this code [1] for this)
> I'm creating temp files due to InputStream is re-positioned during
> type guessing, maybe there are better solutions ...
>
> On Fri, Jan 12, 2018 at 8:12 AM, Martin Todorov <[email protected]>
> wrote:
> >
> >
> > Hi,
> >
> > Thanks for getting back to me!
> >
> > We're working on implementing a new artifact repository manager. Most of
> the
> > files in the repositories will be binaries (usually archives such as jar,
> > war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited to
> just
> > these). Unfortunately, there is no guessing what artifacts people will
> come
> > up with and want to deploy. Hence our thought was to consider Tika for
> this
> > job.
> >
> > Could you point us to some working examples of such partial feeding? If
> > anyone is willing to give this a go, we'd be quite happy as we could use
> all
> > the help! :)
> >
> > Kind regards,
> >
> > Martin
> >
> >
> >
> >
> > On Thu, Jan 11, 2018 at 8:20 PM, Allison, Timothy B. <[email protected]
> >
> > wrote:
> >>
> >> Hi Martin,
> >>
> >> I’m sorry for my delay.  As a first pass at an answer…We have roughly
> >> three mechanisms for file id:
> >>
> >>
> >>
> >> mime patterns (magic mime)
> >> package detection
> >> parse-time sub-type detection
> >> file name extension (completely useless for your purposes)
> >>
> >>
> >>
> >> You should be able to use the mime patterns in a buffered single read.
> >> Buffer the first 1024 bytes or so and run our mime detection.
> >> We are currently opening the zip/package file and looking for particular
> >> files within the zip/package files e.g. docx, xlsx…etc, which requires
> the
> >> whole file and cannot be done by our current methods in a streaming
> fashion.
> >> I don’t see a way around parsing the package/container file
> >> IIRC, some of our parsers update the mime based on knowledge of that
> >> particular format’s subtypes/actually parsing the file (doc, ppt and
> …?) …so
> >> these would be a non-starter.
> >>
> >>
> >>
> >> Regrettably, AFAIK, at least from a Tika perspective, there is no silver
> >> bullet.
> >>
> >>
> >>
> >> Instead of having to spool the complete file to memory (or disk) and
> then
> >> run detection (or having Tika do that) for every file, I wonder if you
> could
> >> run 1) (mime magic detection) on the stream, and, if that returns
> something
> >> obvious, go with that, otherwise spool to disk and then run regular
> Tika on
> >> that subset of files.
> >>
> >>
> >>
> >> Nick Burch will probably have better insight on this than my ramblings
> >> above.
> >>
> >>
> >>
> >> From: Martin Todorov [mailto:[email protected]]
> >> Sent: Thursday, January 4, 2018 8:48 PM
> >> To: [email protected]
> >> Subject: How to implement an InputStream that dynamically guesses the
> >> extension of a file that is streamed using Apache Tika?
> >>
> >>
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> I have asked this on Stackoverflow and was pointed here, with the hope
> >> that more people would be able to help.
> >>
> >>
> >>
> >> We have a custom implementation of an InputStream that can currently
> >> update multiple MessageDigest-s and while reading the data. This allows
> for
> >> a single reading and processing of the data and avoids having to re-read
> >> files in order to calculate their checksums. This is quite efficient and
> >> saves time (and is implemented in here).
> >>
> >>
> >>
> >> As a follow-up step, we'd like to use Apache Tika to guess the file
> >> extension from the stream, which is sent over HTTP. I know some of you
> will
> >> suggest simply setting the Content-Type header and requiring that it's
> set,
> >> but, unfortunately, for various reasons, we cannot rely on this, or
> enforce
> >> it. Hence, I'm looking for a way to guess the extension based on the
> >> InputStream, while it's being sent.
> >>
> >>
> >>
> >> We also need to be able to guess complex extension types (such as
> tar.gz,
> >> tar.bz2 and other similar ones that aren't easy to guess by just doing a
> >> substring from the last index of the dot until the end of the string).
> >>
> >>
> >>
> >> What is the most-efficient way to do this? We cannot afford to read the
> >> whole files in memory, as the application will have to be able to
> handle a
> >> large number of concurrent requests. Could somebody please provide an
> >> example, of how this could be done?
> >>
> >>
> >>
> >> We have an open issue and a pull request here, if anyone would like to
> >> have a closer look and help out.
> >>
> >>
> >>
> >> Looking forward to your suggestions and replies!
> >>
> >> Kind regards,
> >>
> >>
> >>
> >> Martin Todorov
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
>
>
>
> --
> WBR
> Maxim aka solomax
>

Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

Reply via email to