Re: Problem detecting XML

Jukka Zitting Tue, 17 Apr 2012 09:34:13 -0700

Hi,

On Tue, Apr 17, 2012 at 6:06 PM, Taylor, Wade <[email protected]> wrote:
> Hi, thanks for the tips. I opened the XML file with a hex editor and did
> find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF.


That's the UTF-8 byte order mark. I guess Tika should be able to deal
with that, but AFAICT it currently doesn't. Would you mind filing a
bug report about this?

> Then I went back to my code and ran it against the fixed XML file:
>
> new
> Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
>
> but it still detects it as "text/plain".

Hmm, can you verify that the returned input stream actually contains
what you expect it to?

Also, you can check the difference of how Tika detects full files
(with the extra file name hint) and plain byte streams by comparing
the output of the following two commands:

    java -jar tika-app-1.1.jar --detect sample_fixed.wde
    java -jar tika-app-1.1.jar --detect < sample_fixed.wde

BR,

Jukka Zitting

Re: Problem detecting XML

Reply via email to