Re: Can Apache Solr Handle TeraByte Large Data

Otis Gospodnetic Tue, 17 Jan 2012 22:31:00 -0800

Could indexing English Wikipedia dump over and over get you there?

Otis 
----
Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




>________________________________
> From: Memory Makers <memmakers...@gmail.com>
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, January 17, 2012 12:15 AM
>Subject: Re: Can Apache Solr Handle TeraByte Large Data
> 
>I've been toying with the idea of setting up an experiment to index a large
>document set 1+ TB -- any thoughts on an open data set that one could use
>for this purpose?
>
>Thanks.
>
>On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tburt...@umich.edu>wrote:
>
>> Hello ,
>>
>> Searching real-time sounds difficult with that amount of data. With large
>> documents, 3 million documents, and 5TB of data the index will be very
>> large. With indexes that large your performance will probably be I/O bound.
>>
>> Do you plan on allowing phrase or proximity searches? If so, your
>> performance will be even more I/O bound as documents that large will have
>> huge positions indexes that will need to be read into memory for processing
>> phrase queries. To reduce I/O you need as much of the index in memory
>> (Lucene/Solr caches, and operating system disk cache).  Every commit
>> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
>> this for Solr).
>>
>> If you index and serve on the same server, you are also going to get
>> terrible response time whenever your commits trigger a large merge.
>>
>> If you need to service 10-100 qps or more, you may need to look at putting
>> your index on SSDs or spreading it over enough machines so it can stay in
>> memory.
>>
>> What kind of response times are you looking for and what query rate?
>>
>> We have somewhat smaller documents. We have 10 million documents and about
>> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
>> machines (i.e. 3 shards per machine).   We get an average of around
>> 200-300ms response time but our 95th percentile times are about 800ms and
>> 99th percentile are around 2 seconds.  This is with an average load of less
>> than 1 query/second.
>>
>> As Otis suggested, you may want to implement a strategy that allows users
>> to search within the large documents by breaking the documents up into
>> smaller units. What we do is have two Solr indexes.  The first indexes
>> complete documents.  When the user clicks on a result, we index the entire
>> document on a page level in a small Solr index on-the-fly.  That way they
>> can search within the document and get page level results.
>>
>> More details about our setup:
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>> Tom Burton-West
>> University of Michigan Library
>> www.hathitrust.org
>> -----Original Message-----
>>
>>
>
>
>

Re: Can Apache Solr Handle TeraByte Large Data

Reply via email to