If each document is VERY small, it's actually possible that one Solr server could handle it -- especially if you DON'T try to do facetting or other similar features, but stick to straight search and relevancy. There are other factors too. But # of documents is probably less important than total size of index, or number of unique terms -- of course # of documents often correlates to those too.

But if each document is largeish... yeah, I suspect that'll be too much for any one Solr server. You'll have to use some kind of distribution. Out of the box, Solr has a Distributed Search function meant for this use case. http://wiki.apache.org/solr/DistributedSearch . Some Solr features don't work under a Distributed setup, but the basic ones are there. There are some other add-ons not (yet anyway) part of Solr distro that try to solve this in even more sophisticated ways too, like SolrCloud.

I don't personally know of anyone indexing that many documents, although it is probably done. But I do know of the HathiTrust project (not me personally) indexing fewer documents but still adding up to terrabytes of total index (millions to tens of millions of documents, but each one is a digitized book that could be 100-400 pages), using Distributed Searching feature, succesfully, although it required some care and maintenance it wasn't just a "turn it on and it works" situation.

http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-500000-volumes-5-million-volumes-and-beyond

http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf

On 5/12/2011 1:06 PM, Darren Govoni wrote:
I have the same questions.

But from your message, I couldn't tell. Are you using Solr now? Or some
other indexing server?

Darren

On Thu, 2011-05-12 at 09:59 -0700, atreyu wrote:
Hi,

I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take.  The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.

Would Solr be able to efficiently deal with a load of this size?  I am
trying to avoid the heavy cost of GSA, etc...

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.


Reply via email to