Hello Martin, using Tika you can guess mime type (we are using this code [1] for this) I'm creating temp files due to InputStream is re-positioned during type guessing, maybe there are better solutions ...
On Fri, Jan 12, 2018 at 8:12 AM, Martin Todorov <[email protected]> wrote: > > > Hi, > > Thanks for getting back to me! > > We're working on implementing a new artifact repository manager. Most of the > files in the repositories will be binaries (usually archives such as jar, > war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited to just > these). Unfortunately, there is no guessing what artifacts people will come > up with and want to deploy. Hence our thought was to consider Tika for this > job. > > Could you point us to some working examples of such partial feeding? If > anyone is willing to give this a go, we'd be quite happy as we could use all > the help! :) > > Kind regards, > > Martin > > > > > On Thu, Jan 11, 2018 at 8:20 PM, Allison, Timothy B. <[email protected]> > wrote: >> >> Hi Martin, >> >> I’m sorry for my delay. As a first pass at an answer…We have roughly >> three mechanisms for file id: >> >> >> >> mime patterns (magic mime) >> package detection >> parse-time sub-type detection >> file name extension (completely useless for your purposes) >> >> >> >> You should be able to use the mime patterns in a buffered single read. >> Buffer the first 1024 bytes or so and run our mime detection. >> We are currently opening the zip/package file and looking for particular >> files within the zip/package files e.g. docx, xlsx…etc, which requires the >> whole file and cannot be done by our current methods in a streaming fashion. >> I don’t see a way around parsing the package/container file >> IIRC, some of our parsers update the mime based on knowledge of that >> particular format’s subtypes/actually parsing the file (doc, ppt and …?) …so >> these would be a non-starter. >> >> >> >> Regrettably, AFAIK, at least from a Tika perspective, there is no silver >> bullet. >> >> >> >> Instead of having to spool the complete file to memory (or disk) and then >> run detection (or having Tika do that) for every file, I wonder if you could >> run 1) (mime magic detection) on the stream, and, if that returns something >> obvious, go with that, otherwise spool to disk and then run regular Tika on >> that subset of files. >> >> >> >> Nick Burch will probably have better insight on this than my ramblings >> above. >> >> >> >> From: Martin Todorov [mailto:[email protected]] >> Sent: Thursday, January 4, 2018 8:48 PM >> To: [email protected] >> Subject: How to implement an InputStream that dynamically guesses the >> extension of a file that is streamed using Apache Tika? >> >> >> >> >> >> Hi, >> >> >> >> I have asked this on Stackoverflow and was pointed here, with the hope >> that more people would be able to help. >> >> >> >> We have a custom implementation of an InputStream that can currently >> update multiple MessageDigest-s and while reading the data. This allows for >> a single reading and processing of the data and avoids having to re-read >> files in order to calculate their checksums. This is quite efficient and >> saves time (and is implemented in here). >> >> >> >> As a follow-up step, we'd like to use Apache Tika to guess the file >> extension from the stream, which is sent over HTTP. I know some of you will >> suggest simply setting the Content-Type header and requiring that it's set, >> but, unfortunately, for various reasons, we cannot rely on this, or enforce >> it. Hence, I'm looking for a way to guess the extension based on the >> InputStream, while it's being sent. >> >> >> >> We also need to be able to guess complex extension types (such as tar.gz, >> tar.bz2 and other similar ones that aren't easy to guess by just doing a >> substring from the last index of the dot until the end of the string). >> >> >> >> What is the most-efficient way to do this? We cannot afford to read the >> whole files in memory, as the application will have to be able to handle a >> large number of concurrent requests. Could somebody please provide an >> example, of how this could be done? >> >> >> >> We have an open issue and a pull request here, if anyone would like to >> have a closer look and help out. >> >> >> >> Looking forward to your suggestions and replies! >> >> Kind regards, >> >> >> >> Martin Todorov >> >> >> >> >> >> > > -- WBR Maxim aka solomax
