If each document is VERY small, it's actually possible that one Solr
server could handle it -- especially if you DON'T try to do facetting or
other similar features, but stick to straight search and relevancy.
There are other factors too. But # of documents is probably less
important than total size of index, or number of unique terms -- of
course # of documents often correlates to those too.
But if each document is largeish... yeah, I suspect that'll be too much
for any one Solr server. You'll have to use some kind of distribution.
Out of the box, Solr has a Distributed Search function meant for this
use case. http://wiki.apache.org/solr/DistributedSearch . Some Solr
features don't work under a Distributed setup, but the basic ones are
there. There are some other add-ons not (yet anyway) part of Solr distro
that try to solve this in even more sophisticated ways too, like SolrCloud.
I don't personally know of anyone indexing that many documents, although
it is probably done. But I do know of the HathiTrust project (not me
personally) indexing fewer documents but still adding up to terrabytes
of total index (millions to tens of millions of documents, but each one
is a digitized book that could be 100-400 pages), using Distributed
Searching feature, succesfully, although it required some care and
maintenance it wasn't just a "turn it on and it works" situation.
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-500000-volumes-5-million-volumes-and-beyond
http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf
On 5/12/2011 1:06 PM, Darren Govoni wrote:
I have the same questions.
But from your message, I couldn't tell. Are you using Solr now? Or some
other indexing server?
Darren
On Thu, 2011-05-12 at 09:59 -0700, atreyu wrote:
Hi,
I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take. The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files. There are b/t two and four
dozen concurrent users, typically. The indexing server has> 27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.
Would Solr be able to efficiently deal with a load of this size? I am
trying to avoid the heavy cost of GSA, etc...
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.