Re: Problem detecting Microsoft Office formats from InputStream

Jukka Zitting Sun, 23 Sep 2012 12:34:13 -0700

Hi,

On Sun, Sep 23, 2012 at 8:07 PM, naskoo <[email protected]> wrote:
> Thanks for the suggestion. That way the problem is solved at some point.
> I run some more tests, but this time I removed the ms file extensions.
> I get the same not consistent results as before, even if I use
> TikaInputStream as a wrapper.
> Probably TikaInputStream just adds some metadata to include the file
> extension in the detection.


It doesn't add extra metadata (unless explicitly requested). Instead
the TikaInputStream class allows Tika parsers and detectors to use
random access for reading the underlying file.

The MS Office detectors (and a few other features in Tika) rely on
that functionality, and thus won't give as accurate results when given
just a plain InputStream instance.

BR,

Jukka Zitting

Re: Problem detecting Microsoft Office formats from InputStream

Reply via email to