Hi Claudio By decoupling fetching and parsing, as you advised me, i could see that > the tika parser is actually looping on some files (which the file > command sees as "data" files) that apache is declaring with a mime type > plain/text, fooling tika. I guess tika should be able to handle this > error which is not. >
The parameter 'mime.type.magic' specify whether the detection of the mimetype by Tika has to be used or not. The default value is true so assuming that this is what you are using then Tika would simply use the mime type advertised by the server as a hint only. Try Tika directly with ***curl URL | java -jar tika-app-0.8-SNAPSHOT.jar* to see how it is behaving without being given a hint. Are the URLs you mentioned publicly available? If so please file a JIRA in Tika to describe the issue Thanks J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

