On 4/25/2018 4:02 AM, Lee Carroll wrote:
    *We don't recommend using solr-cell for production indexing.*

Ok. Are the reasons for:

Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)

Security. The index workflow is, upload files to public facing server with
auth. Files written to disk, scanned and copied to internal server and
ingested into index via here.

  other reasons we should worry about ?

Tika is the underlying technology in solr-cell.  Tika is a separate Apache product designed for parsing common rich-text formats, like Microsoft, PDF, etc.

http://tika.apache.org/

The problems that can result are related to running Tika inside of Solr, which is what solr-cell does.

The Tika authors try very hard to make sure that Tika doesn't misbehave, but the very nature of what Tika does means it is somewhat prone to misbehaving.  Many of the file formats that Tika processes are undocumented, or any documentation that is available is not available to open source developers.  Also, sometimes documents in those formats will be constructed in a way that the Tika authors have never seen before, or they may completely violate what conventions the authors DO know about.

Long story short -- Tika can encounter documents that can cause it to crash, or to consume all the memory in the system, or misbehave in other ways.  If Tika is running inside Solr, then when it has a problem, Solr itself can blow up and have a problem too.

For this reason, and because Tika can sometimes use a lot of resources even when it is working correctly, we recommend running it outside of Solr in another program that takes its output and sends it to Solr.  Ideally, it will be running on a completely different machine than Solr is running on.

Thanks,
Shawn

Reply via email to