Re: Solr hanging when extracting a some broken .doc files

Raymond Wiker Thu, 19 Dec 2013 01:47:34 -0800

On Thu, Dec 19, 2013 at 10:01 AM, Charlie Hull <char...@flax.co.uk> wrote:

> On 18/12/2013 09:03, Alexandre Rafalovitch wrote:
>
>> Charlie,
>>
>> Does it mean you are talking to it from a client program? Or are you
>> running Tika in a listen/server mode and build some adapters for standard
>> Solr processes?
>>
>
> If we're writing indexers in Python we usually run Tika as a server -
> which means we can try to restart it if it fails to respond, usually
> because it's eaten something that disagreed with it! We'd then submit the
> extracted text to Solr.
>
>
>
We're also running Tika as a server, using tika-app.*.jar. There is also a
tika-server.*.jar, which gives an HTTP interface (instead of the raw TCP
interface offered by tika-app), but we opted to use tika-app.

We have not seen any need to restart the tika server process, although
there are cases where it takes so long to provide a reply that we abandon
the request - tika-app seems to handle that well (i.e, it does not seem to
get stuck afterwards).

There are some semi-tricky details to using tika in server mode (involving
blocking & deadlocks, and the possibility that tika loops on certain
documents), but we have been able to feed ~1M documents through a single
tika server process without restarting it.

Note that, in some cases, the xhtml output from tika is incorrect, so we've
had to switch to html output and a more forgiving parser.

Re: Solr hanging when extracting a some broken .doc files

Reply via email to