Hi Claudio

By decoupling fetching and parsing, as you advised me, i could see that
> the tika parser is actually looping on some files (which the file
> command sees as "data" files) that apache is declaring with a mime type
> plain/text, fooling tika. I guess tika should be able to handle this
> error which is not.
>

The parameter 'mime.type.magic' specify whether the detection of the
mimetype by Tika has to be used or not. The default value is true so
assuming that this is what you are using then Tika would simply use the mime
type advertised by the server as a hint only.

Try Tika directly with

***curl URL | java -jar tika-app-0.8-SNAPSHOT.jar*

to see how it is behaving without being given a hint. Are the URLs you
mentioned publicly available? If so please file a JIRA in Tika to describe
the issue

Thanks

J.

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to