Hi Martin,
I’m sorry for my delay.  As a first pass at an answer…We have roughly three 
mechanisms for file id:


  1.  mime patterns (magic mime)
  2.  package detection
  3.  parse-time sub-type detection
  4.  file name extension (completely useless for your purposes)


  1.  You should be able to use the mime patterns in a buffered single read.  
Buffer the first 1024 bytes or so and run our mime detection.
  2.  We are currently opening the zip/package file and looking for particular 
files within the zip/package files e.g. docx, xlsx…etc, which requires the 
whole file and cannot be done by our current methods in a streaming fashion.  I 
don’t see a way around parsing the package/container file
  3.  IIRC, some of our parsers update the mime based on knowledge of that 
particular format’s subtypes/actually parsing the file (doc, ppt and …?) …so 
these would be a non-starter.

Regrettably, AFAIK, at least from a Tika perspective, there is no silver bullet.

Instead of having to spool the complete file to memory (or disk) and then run 
detection (or having Tika do that) for every file, I wonder if you could run 1) 
(mime magic detection) on the stream, and, if that returns something obvious, 
go with that, otherwise spool to disk and then run regular Tika on that subset 
of files.

Nick Burch will probably have better insight on this than my ramblings above.

From: Martin Todorov [mailto:[email protected]]
Sent: Thursday, January 4, 2018 8:48 PM
To: [email protected]
Subject: How to implement an InputStream that dynamically guesses the extension 
of a file that is streamed using Apache Tika?


Hi,

I have asked this on 
Stackoverflow<https://stackoverflow.com/questions/48102004/how-to-implement-an-inputstream-that-dynamically-guesses-the-extension-of-a-file>
 and was pointed here, with the hope that more people would be able to help.

We have a custom implementation of an InputStream that can currently update 
multiple MessageDigest-s and while reading the data. This allows for a single 
reading and processing of the data and avoids having to re-read files in order 
to calculate their checksums. This is quite efficient and saves time (and is 
implemented in 
here<https://github.com/strongbox/strongbox/blob/9dcb13255512cd396e63f712bb5ce82bb632726c/strongbox-storage/strongbox-storage-core/src/main/java/org/carlspring/strongbox/io/ArtifactInputStream.java>).

As a follow-up step, we'd like to use Apache Tika to guess the file extension 
from the stream, which is sent over HTTP. I know some of you will suggest 
simply setting the Content-Type header and requiring that it's set, but, 
unfortunately, for various reasons, we cannot rely on this, or enforce it. 
Hence, I'm looking for a way to guess the extension based on the InputStream, 
while it's being sent.

We also need to be able to guess complex extension types (such as tar.gz, 
tar.bz2 and other similar ones that aren't easy to guess by just doing a 
substring from the last index of the dot until the end of the string).

What is the most-efficient way to do this? We cannot afford to read the whole 
files in memory, as the application will have to be able to handle a large 
number of concurrent requests. Could somebody please provide an example, of how 
this could be done?

We have an open issue<https://github.com/strongbox/strongbox/issues/370> and a 
pull request 
here<https://github.com/strongbox/strongbox/pull/468/files#diff-8024b836036b6f5fb567a3ce48c2a4d6R221>,
 if anyone would like to have a closer look and help out.

Looking forward to your suggestions and replies!
Kind regards,

Martin Todorov



Reply via email to