We've had a great deal of success running Tika from Solr server as a
document extractor (I believe Solr refers to it as Solr Cell).
http://wiki.apache.org/solr/ExtractingRequestHandler
Cheers
Hayden
On 24/06/11 18:31, Marian Steinbach wrote:
Hi!
I have tested the Tika client for extraction of content, metadata and
language and I'm really happy with the results.
For performance reasons when extracting larger numbers of documents I
think it would be worthwhile to avoid starting the client three times
for each document, which also includes starting the virtual machine
etc.
I was thinking about having Tika running as a daemon and pushing
document path info to it, in order to get the metadata, content and
language as a response.
Is there a best practice for this? Maybe a servlet/jsp solution? Does
the current Tika release include an out of the box solution for that?
(I only found https://issues.apache.org/jira/browse/TIKA-169 on this
topic, which is pretty old and has "won't fix" status.)
Thanks!
Marian