Hi,
On Tue, Apr 17, 2012 at 6:06 PM, Taylor, Wade <[email protected]> wrote:
> Hi, thanks for the tips. I opened the XML file with a hex editor and did
> find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF.
That's the UTF-8 byte order mark. I guess Tika should be able to deal
with that, but AFAICT it currently doesn't. Would you mind filing a
bug report about this?
> Then I went back to my code and ran it against the fixed XML file:
>
> new
> Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
>
> but it still detects it as "text/plain".
Hmm, can you verify that the returned input stream actually contains
what you expect it to?
Also, you can check the difference of how Tika detects full files
(with the extra file name hint) and plain byte streams by comparing
the output of the following two commands:
java -jar tika-app-1.1.jar --detect sample_fixed.wde
java -jar tika-app-1.1.jar --detect < sample_fixed.wde
BR,
Jukka Zitting