Hello Martin,

using Tika you can guess mime type (we are using this code [1] for this)
I'm creating temp files due to InputStream is re-positioned during
type guessing, maybe there are better solutions ...

On Fri, Jan 12, 2018 at 8:12 AM, Martin Todorov <[email protected]> wrote:
>
>
> Hi,
>
> Thanks for getting back to me!
>
> We're working on implementing a new artifact repository manager. Most of the
> files in the repositories will be binaries (usually archives such as jar,
> war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited to just
> these). Unfortunately, there is no guessing what artifacts people will come
> up with and want to deploy. Hence our thought was to consider Tika for this
> job.
>
> Could you point us to some working examples of such partial feeding? If
> anyone is willing to give this a go, we'd be quite happy as we could use all
> the help! :)
>
> Kind regards,
>
> Martin
>
>
>
>
> On Thu, Jan 11, 2018 at 8:20 PM, Allison, Timothy B. <[email protected]>
> wrote:
>>
>> Hi Martin,
>>
>> I’m sorry for my delay.  As a first pass at an answer…We have roughly
>> three mechanisms for file id:
>>
>>
>>
>> mime patterns (magic mime)
>> package detection
>> parse-time sub-type detection
>> file name extension (completely useless for your purposes)
>>
>>
>>
>> You should be able to use the mime patterns in a buffered single read.
>> Buffer the first 1024 bytes or so and run our mime detection.
>> We are currently opening the zip/package file and looking for particular
>> files within the zip/package files e.g. docx, xlsx…etc, which requires the
>> whole file and cannot be done by our current methods in a streaming fashion.
>> I don’t see a way around parsing the package/container file
>> IIRC, some of our parsers update the mime based on knowledge of that
>> particular format’s subtypes/actually parsing the file (doc, ppt and …?) …so
>> these would be a non-starter.
>>
>>
>>
>> Regrettably, AFAIK, at least from a Tika perspective, there is no silver
>> bullet.
>>
>>
>>
>> Instead of having to spool the complete file to memory (or disk) and then
>> run detection (or having Tika do that) for every file, I wonder if you could
>> run 1) (mime magic detection) on the stream, and, if that returns something
>> obvious, go with that, otherwise spool to disk and then run regular Tika on
>> that subset of files.
>>
>>
>>
>> Nick Burch will probably have better insight on this than my ramblings
>> above.
>>
>>
>>
>> From: Martin Todorov [mailto:[email protected]]
>> Sent: Thursday, January 4, 2018 8:48 PM
>> To: [email protected]
>> Subject: How to implement an InputStream that dynamically guesses the
>> extension of a file that is streamed using Apache Tika?
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>> I have asked this on Stackoverflow and was pointed here, with the hope
>> that more people would be able to help.
>>
>>
>>
>> We have a custom implementation of an InputStream that can currently
>> update multiple MessageDigest-s and while reading the data. This allows for
>> a single reading and processing of the data and avoids having to re-read
>> files in order to calculate their checksums. This is quite efficient and
>> saves time (and is implemented in here).
>>
>>
>>
>> As a follow-up step, we'd like to use Apache Tika to guess the file
>> extension from the stream, which is sent over HTTP. I know some of you will
>> suggest simply setting the Content-Type header and requiring that it's set,
>> but, unfortunately, for various reasons, we cannot rely on this, or enforce
>> it. Hence, I'm looking for a way to guess the extension based on the
>> InputStream, while it's being sent.
>>
>>
>>
>> We also need to be able to guess complex extension types (such as tar.gz,
>> tar.bz2 and other similar ones that aren't easy to guess by just doing a
>> substring from the last index of the dot until the end of the string).
>>
>>
>>
>> What is the most-efficient way to do this? We cannot afford to read the
>> whole files in memory, as the application will have to be able to handle a
>> large number of concurrent requests. Could somebody please provide an
>> example, of how this could be done?
>>
>>
>>
>> We have an open issue and a pull request here, if anyone would like to
>> have a closer look and help out.
>>
>>
>>
>> Looking forward to your suggestions and replies!
>>
>> Kind regards,
>>
>>
>>
>> Martin Todorov
>>
>>
>>
>>
>>
>>
>
>



-- 
WBR
Maxim aka solomax

Reply via email to