Hi, I still can't get Tika to detect the right mime-type. When I use
Tika-app it returns the correct mime-type, so I dug into the source and I
can't see what's different. Since I couldn't get that to work I went back
to basics and tried a simple XML string:

new Tika().detect(new ByteArrayInputStream("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><root><child>text</child></root>".getBytes())));

but this gets detected as "text/plain" too and I can't figure out why it's
not coming back as "application/xml".


Regards,
Wade




On Tue, Apr 17, 2012 at 12:33 PM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Tue, Apr 17, 2012 at 6:06 PM, Taylor, Wade <[email protected]> wrote:
> > Hi, thanks for the tips. I opened the XML file with a hex editor and did
> > find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF.
>
> That's the UTF-8 byte order mark. I guess Tika should be able to deal
> with that, but AFAICT it currently doesn't. Would you mind filing a
> bug report about this?
>
> > Then I went back to my code and ran it against the fixed XML file:
> >
> > new
> >
> Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
> >
> > but it still detects it as "text/plain".
>
> Hmm, can you verify that the returned input stream actually contains
> what you expect it to?
>
> Also, you can check the difference of how Tika detects full files
> (with the extra file name hint) and plain byte streams by comparing
> the output of the following two commands:
>
>    java -jar tika-app-1.1.jar --detect sample_fixed.wde
>    java -jar tika-app-1.1.jar --detect < sample_fixed.wde
>
> BR,
>
> Jukka Zitting
>

Reply via email to