Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s 
optimized to be an index , not a file store. Better to put that in another DB 
or file system like Cassandra, S3, etc. (better than SolR).

In our experience , leveraging the tika binary / microservice as a pre-index 
process can improve the overall stability of the SolR service.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <apa...@elyograg.org>, wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
> >
> > Ok. Are the reasons for:
> >
> > Performance. I think we have rather modest index requirement (1000 a day...
> > on a busy day)
> >
> > Security. The index workflow is, upload files to public facing server with
> > auth. Files written to disk, scanned and copied to internal server and
> > ingested into index via here.
> >
> > other reasons we should worry about ?
>
> Tika is the underlying technology in solr-cell.  Tika is a separate
> Apache product designed for parsing common rich-text formats, like
> Microsoft, PDF, etc.
>
> http://tika.apache.org/
>
> The problems that can result are related to running Tika inside of Solr,
> which is what solr-cell does.
>
> The Tika authors try very hard to make sure that Tika doesn't misbehave,
> but the very nature of what Tika does means it is somewhat prone to
> misbehaving.  Many of the file formats that Tika processes are
> undocumented, or any documentation that is available is not available to
> open source developers.  Also, sometimes documents in those formats will
> be constructed in a way that the Tika authors have never seen before, or
> they may completely violate what conventions the authors DO know about.
>
> Long story short -- Tika can encounter documents that can cause it to
> crash, or to consume all the memory in the system, or misbehave in other
> ways.  If Tika is running inside Solr, then when it has a problem, Solr
> itself can blow up and have a problem too.
>
> For this reason, and because Tika can sometimes use a lot of resources
> even when it is working correctly, we recommend running it outside of
> Solr in another program that takes its output and sends it to Solr.
> Ideally, it will be running on a completely different machine than Solr
> is running on.
>
> Thanks,
> Shawn
>

Reply via email to