Hi, thanks for the tips. I opened the XML file with a hex editor and did
find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF. When I remove
them and run:
java -jar tika-app-1.1.jar --detect sample_fixed.xml
it outputs "application/xml".
I then changed the file name to sample_fixed.wde to ensure that the byte
stream was being used in the detection. Running:
java -jar tika-app-1.1.jar --detect sample_fixed.wde
also outputs "application/xml".
Then I went back to my code and ran it against the fixed XML file:
new
Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
but it still detects it as "text/plain".
Any idea why using Tika.detect() reports a different type than Tika-app?
Regards,
Wade
On Tue, Apr 17, 2012 at 9:50 AM, Jukka Zitting <[email protected]>wrote:
> Hi,
>
> On Tue, Apr 17, 2012 at 3:32 PM, Uwe Schindler <[email protected]> wrote:
> > I think the problem is that the detection does not see the filename. If
> you
> > pass a InputStream to the detection method, you should also pass metadata
> > (including the file name).
>
> Tika should have no trouble detecting XML also from just the byte stream.
>
> A typical reason why an XML document is detected as text/plain is if
> it's actually not valid XML, either because of some well-formedness
> issue (unclosed tags) or because of some extra characters like
> suggested by Nick.
>
> BR,
>
> Jukka Zitting
>