Solr uses Tika directly. And not in the most efficient way. It is
there mostly for convenience rather than performance.

So, for performance, Solr recommendation is also to run Tika
separately and only send Solr the processed documents.

Regards,
    Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote:
> Hi folks,
>
> I'm writing a file-system-crawler that will index files.  The file system
> is going to be very busy an I anticipate on average 10 new updates per
> min.  My application checks for new or updated files once every 1 min.  I
> use Tika to extract the raw-text off those files and send them over to Solr
> for indexing.  My application will be running 24x7xN-days.  It will not
> recycle unless if the OS is restarted.
>
> Over at Tika mailing list, I was told the following:
>
> "As a side note, if you are handling a bunch of files from the wild in a
> production environment, I encourage separating Tika into a separate jvm vs
> tying it into any post processing – consider tika-batch and writing
> separate text files for each file processed (not so efficient, but
> exceedingly robust).  If this is demo code or you know your document set
> well enough, you should be good to go with keeping Tika and your
> postprocessing steps in the same jvm."
>
> My question is, how does Solr utilize Tika?  Does it run Tika in its own
> JVM as an out-of-process application or does it link with Tika JARs
> directly?  If it links in directly, are there known issues with Solr
> integrated with Tika because of Tika issues?
>
> Thanks
>
> Steve

Reply via email to