Could indexing English Wikipedia dump over and over get you there? Otis ---- Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>________________________________ > From: Memory Makers <memmakers...@gmail.com> >To: solr-user@lucene.apache.org >Sent: Tuesday, January 17, 2012 12:15 AM >Subject: Re: Can Apache Solr Handle TeraByte Large Data > >I've been toying with the idea of setting up an experiment to index a large >document set 1+ TB -- any thoughts on an open data set that one could use >for this purpose? > >Thanks. > >On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tburt...@umich.edu>wrote: > >> Hello , >> >> Searching real-time sounds difficult with that amount of data. With large >> documents, 3 million documents, and 5TB of data the index will be very >> large. With indexes that large your performance will probably be I/O bound. >> >> Do you plan on allowing phrase or proximity searches? If so, your >> performance will be even more I/O bound as documents that large will have >> huge positions indexes that will need to be read into memory for processing >> phrase queries. To reduce I/O you need as much of the index in memory >> (Lucene/Solr caches, and operating system disk cache). Every commit >> invalidates the Solr/Lucene caches (unless the newer nrt code has solved >> this for Solr). >> >> If you index and serve on the same server, you are also going to get >> terrible response time whenever your commits trigger a large merge. >> >> If you need to service 10-100 qps or more, you may need to look at putting >> your index on SSDs or spreading it over enough machines so it can stay in >> memory. >> >> What kind of response times are you looking for and what query rate? >> >> We have somewhat smaller documents. We have 10 million documents and about >> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 >> machines (i.e. 3 shards per machine). We get an average of around >> 200-300ms response time but our 95th percentile times are about 800ms and >> 99th percentile are around 2 seconds. This is with an average load of less >> than 1 query/second. >> >> As Otis suggested, you may want to implement a strategy that allows users >> to search within the large documents by breaking the documents up into >> smaller units. What we do is have two Solr indexes. The first indexes >> complete documents. When the user clicks on a result, we index the entire >> document on a page level in a small Solr index on-the-fly. That way they >> can search within the document and get page level results. >> >> More details about our setup: >> http://www.hathitrust.org/blogs/large-scale-search >> >> Tom Burton-West >> University of Michigan Library >> www.hathitrust.org >> -----Original Message----- >> >> > > >