On Thu, Dec 19, 2013 at 10:01 AM, Charlie Hull <char...@flax.co.uk> wrote:
> On 18/12/2013 09:03, Alexandre Rafalovitch wrote: > >> Charlie, >> >> Does it mean you are talking to it from a client program? Or are you >> running Tika in a listen/server mode and build some adapters for standard >> Solr processes? >> > > If we're writing indexers in Python we usually run Tika as a server - > which means we can try to restart it if it fails to respond, usually > because it's eaten something that disagreed with it! We'd then submit the > extracted text to Solr. > > > We're also running Tika as a server, using tika-app.*.jar. There is also a tika-server.*.jar, which gives an HTTP interface (instead of the raw TCP interface offered by tika-app), but we opted to use tika-app. We have not seen any need to restart the tika server process, although there are cases where it takes so long to provide a reply that we abandon the request - tika-app seems to handle that well (i.e, it does not seem to get stuck afterwards). There are some semi-tricky details to using tika in server mode (involving blocking & deadlocks, and the possibility that tika loops on certain documents), but we have been able to feed ~1M documents through a single tika server process without restarting it. Note that, in some cases, the xhtml output from tika is incorrect, so we've had to switch to html output and a more forgiving parser.