Hi Maxim, I would prefer to be able to do it in-memory with some buffered parts of the file, while reading the stream, if possible...
Kind regards, Martin On Fri, Jan 12, 2018 at 1:26 AM, Maxim Solodovnik <[email protected]> wrote: > Hello Martin, > > using Tika you can guess mime type (we are using this code [1] for this) > I'm creating temp files due to InputStream is re-positioned during > type guessing, maybe there are better solutions ... > > On Fri, Jan 12, 2018 at 8:12 AM, Martin Todorov <[email protected]> > wrote: > > > > > > Hi, > > > > Thanks for getting back to me! > > > > We're working on implementing a new artifact repository manager. Most of > the > > files in the repositories will be binaries (usually archives such as jar, > > war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited to > just > > these). Unfortunately, there is no guessing what artifacts people will > come > > up with and want to deploy. Hence our thought was to consider Tika for > this > > job. > > > > Could you point us to some working examples of such partial feeding? If > > anyone is willing to give this a go, we'd be quite happy as we could use > all > > the help! :) > > > > Kind regards, > > > > Martin > > > > > > > > > > On Thu, Jan 11, 2018 at 8:20 PM, Allison, Timothy B. <[email protected] > > > > wrote: > >> > >> Hi Martin, > >> > >> I’m sorry for my delay. As a first pass at an answer…We have roughly > >> three mechanisms for file id: > >> > >> > >> > >> mime patterns (magic mime) > >> package detection > >> parse-time sub-type detection > >> file name extension (completely useless for your purposes) > >> > >> > >> > >> You should be able to use the mime patterns in a buffered single read. > >> Buffer the first 1024 bytes or so and run our mime detection. > >> We are currently opening the zip/package file and looking for particular > >> files within the zip/package files e.g. docx, xlsx…etc, which requires > the > >> whole file and cannot be done by our current methods in a streaming > fashion. > >> I don’t see a way around parsing the package/container file > >> IIRC, some of our parsers update the mime based on knowledge of that > >> particular format’s subtypes/actually parsing the file (doc, ppt and > …?) …so > >> these would be a non-starter. > >> > >> > >> > >> Regrettably, AFAIK, at least from a Tika perspective, there is no silver > >> bullet. > >> > >> > >> > >> Instead of having to spool the complete file to memory (or disk) and > then > >> run detection (or having Tika do that) for every file, I wonder if you > could > >> run 1) (mime magic detection) on the stream, and, if that returns > something > >> obvious, go with that, otherwise spool to disk and then run regular > Tika on > >> that subset of files. > >> > >> > >> > >> Nick Burch will probably have better insight on this than my ramblings > >> above. > >> > >> > >> > >> From: Martin Todorov [mailto:[email protected]] > >> Sent: Thursday, January 4, 2018 8:48 PM > >> To: [email protected] > >> Subject: How to implement an InputStream that dynamically guesses the > >> extension of a file that is streamed using Apache Tika? > >> > >> > >> > >> > >> > >> Hi, > >> > >> > >> > >> I have asked this on Stackoverflow and was pointed here, with the hope > >> that more people would be able to help. > >> > >> > >> > >> We have a custom implementation of an InputStream that can currently > >> update multiple MessageDigest-s and while reading the data. This allows > for > >> a single reading and processing of the data and avoids having to re-read > >> files in order to calculate their checksums. This is quite efficient and > >> saves time (and is implemented in here). > >> > >> > >> > >> As a follow-up step, we'd like to use Apache Tika to guess the file > >> extension from the stream, which is sent over HTTP. I know some of you > will > >> suggest simply setting the Content-Type header and requiring that it's > set, > >> but, unfortunately, for various reasons, we cannot rely on this, or > enforce > >> it. Hence, I'm looking for a way to guess the extension based on the > >> InputStream, while it's being sent. > >> > >> > >> > >> We also need to be able to guess complex extension types (such as > tar.gz, > >> tar.bz2 and other similar ones that aren't easy to guess by just doing a > >> substring from the last index of the dot until the end of the string). > >> > >> > >> > >> What is the most-efficient way to do this? We cannot afford to read the > >> whole files in memory, as the application will have to be able to > handle a > >> large number of concurrent requests. Could somebody please provide an > >> example, of how this could be done? > >> > >> > >> > >> We have an open issue and a pull request here, if anyone would like to > >> have a closer look and help out. > >> > >> > >> > >> Looking forward to your suggestions and replies! > >> > >> Kind regards, > >> > >> > >> > >> Martin Todorov > >> > >> > >> > >> > >> > >> > > > > > > > > -- > WBR > Maxim aka solomax >
