Hi, thanks for the tips. I opened the XML file with a hex editor and did
find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF. When I remove
them and run:

java -jar tika-app-1.1.jar --detect sample_fixed.xml

it outputs "application/xml".

I then changed the file name to sample_fixed.wde to ensure that the byte
stream was being used in the detection. Running:

java -jar tika-app-1.1.jar --detect sample_fixed.wde

also outputs "application/xml".

Then I went back to my code and ran it against the fixed XML file:

new
Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))

but it still detects it as "text/plain".


Any idea why using Tika.detect() reports a different type than Tika-app?


Regards,
Wade


On Tue, Apr 17, 2012 at 9:50 AM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Tue, Apr 17, 2012 at 3:32 PM, Uwe Schindler <[email protected]> wrote:
> > I think the problem is that the detection does not see the filename. If
> you
> > pass a InputStream to the detection method, you should also pass metadata
> > (including the file name).
>
> Tika should have no trouble detecting XML also from just the byte stream.
>
> A typical reason why an XML document is detected as text/plain is if
> it's actually not valid XML, either because of some well-formedness
> issue (unclosed tags) or because of some extra characters like
> suggested by Nick.
>
> BR,
>
> Jukka Zitting
>

Reply via email to