Hi, On Sun, Sep 23, 2012 at 8:07 PM, naskoo <[email protected]> wrote: > Thanks for the suggestion. That way the problem is solved at some point. > I run some more tests, but this time I removed the ms file extensions. > I get the same not consistent results as before, even if I use > TikaInputStream as a wrapper. > Probably TikaInputStream just adds some metadata to include the file > extension in the detection.
It doesn't add extra metadata (unless explicitly requested). Instead the TikaInputStream class allows Tika parsers and detectors to use random access for reading the underlying file. The MS Office detectors (and a few other features in Tika) rely on that functionality, and thus won't give as accurate results when given just a plain InputStream instance. BR, Jukka Zitting
