Re: Support for huge data set?

Jonathan Rochkind Thu, 12 May 2011 12:44:42 -0700

If each document is VERY small, it's actually possible that one Solrserver could handle it -- especially if you DON'T try to do facetting orother similar features, but stick to straight search and relevancy.There are other factors too. But # of documents is probably lessimportant than total size of index, or number of unique terms -- ofcourse # of documents often correlates to those too.

But if each document is largeish... yeah, I suspect that'll be too muchfor any one Solr server. You'll have to use some kind of distribution.Out of the box, Solr has a Distributed Search function meant for thisuse case. http://wiki.apache.org/solr/DistributedSearch . Some Solrfeatures don't work under a Distributed setup, but the basic ones arethere. There are some other add-ons not (yet anyway) part of Solr distrothat try to solve this in even more sophisticated ways too, like SolrCloud.

I don't personally know of anyone indexing that many documents, althoughit is probably done. But I do know of the HathiTrust project (not mepersonally) indexing fewer documents but still adding up to terrabytesof total index (millions to tens of millions of documents, but each oneis a digitized book that could be 100-400 pages), using DistributedSearching feature, succesfully, although it required some care andmaintenance it wasn't just a "turn it on and it works" situation.


http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-500000-volumes-5-million-volumes-and-beyond

http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf

On 5/12/2011 1:06 PM, Darren Govoni wrote:

I have the same questions.

But from your message, I couldn't tell. Are you using Solr now? Or some
other indexing server?

Darren

On Thu, 2011-05-12 at 09:59 -0700, atreyu wrote:

Hi,

I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take.  The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.

Would Solr be able to efficiently deal with a load of this size?  I am
trying to avoid the heavy cost of GSA, etc...

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Support for huge data set?

Reply via email to