Hi! I have tested the Tika client for extraction of content, metadata and language and I'm really happy with the results.
For performance reasons when extracting larger numbers of documents I think it would be worthwhile to avoid starting the client three times for each document, which also includes starting the virtual machine etc. I was thinking about having Tika running as a daemon and pushing document path info to it, in order to get the metadata, content and language as a response. Is there a best practice for this? Maybe a servlet/jsp solution? Does the current Tika release include an out of the box solution for that? (I only found https://issues.apache.org/jira/browse/TIKA-169 on this topic, which is pretty old and has "won't fix" status.) Thanks! Marian
