Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s optimized to be an index , not a file store. Better to put that in another DB or file system like Cassandra, S3, etc. (better than SolR).
In our experience , leveraging the tika binary / microservice as a pre-index process can improve the overall stability of the SolR service. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <apa...@elyograg.org>, wrote: > On 4/25/2018 4:02 AM, Lee Carroll wrote: > > *We don't recommend using solr-cell for production indexing.* > > > > Ok. Are the reasons for: > > > > Performance. I think we have rather modest index requirement (1000 a day... > > on a busy day) > > > > Security. The index workflow is, upload files to public facing server with > > auth. Files written to disk, scanned and copied to internal server and > > ingested into index via here. > > > > other reasons we should worry about ? > > Tika is the underlying technology in solr-cell. Tika is a separate > Apache product designed for parsing common rich-text formats, like > Microsoft, PDF, etc. > > http://tika.apache.org/ > > The problems that can result are related to running Tika inside of Solr, > which is what solr-cell does. > > The Tika authors try very hard to make sure that Tika doesn't misbehave, > but the very nature of what Tika does means it is somewhat prone to > misbehaving. Many of the file formats that Tika processes are > undocumented, or any documentation that is available is not available to > open source developers. Also, sometimes documents in those formats will > be constructed in a way that the Tika authors have never seen before, or > they may completely violate what conventions the authors DO know about. > > Long story short -- Tika can encounter documents that can cause it to > crash, or to consume all the memory in the system, or misbehave in other > ways. If Tika is running inside Solr, then when it has a problem, Solr > itself can blow up and have a problem too. > > For this reason, and because Tika can sometimes use a lot of resources > even when it is working correctly, we recommend running it outside of > Solr in another program that takes its output and sends it to Solr. > Ideally, it will be running on a completely different machine than Solr > is running on. > > Thanks, > Shawn >