RE: what s the optimum size of SOLR indexes

2011-07-05 Thread Burton-West, Tom
Hello,

On Mon, 2011-07-04 at 13:51 +0200, Jame Vaalet wrote:
 What would be the maximum size of a single SOLR index file for resulting in 
 optimum search time ?

How do you define optimimum?   Do you want the fastest possible response time 
at any cost or do you have a specific response time goal? 

Can you give us more details on your use case?   What kind of load are you 
expecting?  What kind of queries do you need to support?
Some of the trade-offs depend if you are CPU bound or I/O bound.

Assuming a fairly large index, if you *absolutely need* the fastest possible 
search response time and you can *afford the hardware*, you probably want to 
shard your index and size your indexes so they can all fit in memory (and do 
some work to make sure the index data is always in memory).  If you can't 
afford that much memory, but still need very fast response times, you might 
want to size your indexes so they all fit on SSD's.  As an example of a use 
case on the opposite side of the spectrum, here at HathiTrust, we have a very 
low number of queries per second and we are running an index that totals 6 TB 
in size with shards of about 500GB and average response times of 200ms (but 
99th percentile times of about 2 seconds).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



Re: what s the optimum size of SOLR indexes

2011-07-05 Thread arian487
It depends on how many queries you'd be making per second.  I know for us, I
have a gradient of index sizes.  The first machine, which gets hit most
often is about 2.5 gigs.  Most of the queries would only ever need to hit
this index but then I have a bigger indices of about 5-10 gigs each which
are slower, but don't get queried as often so I can afford them to be a
little slower (and hence the bigger index)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-s-the-optimum-size-of-SOLR-indexes-tp3137314p3142309.html
Sent from the Solr - User mailing list archive at Nabble.com.


what s the optimum size of SOLR indexes

2011-07-04 Thread Jame Vaalet
Hi,

What would be the maximum size of a single SOLR index file for resulting in 
optimum search time ?
In case I have got to index all the documents in my repository  (which is in TB 
size) what would be the ideal architecture to follow , distributed SOLR ?

Regards,
JAME VAALET
Software Developer
EXT :8108
Capital IQ



Re: what s the optimum size of SOLR indexes

2011-07-04 Thread Mohammad Shariq
There are Solutions for Indexing huge data. e.g.  SolrCloud,
ZooKeeperIntegration, MultiCore, MultiShard.
depending on your requirement you can choose one or other.


On 4 July 2011 17:21, Jame Vaalet jvaa...@capitaliq.com wrote:

 Hi,

 What would be the maximum size of a single SOLR index file for resulting in
 optimum search time ?
 In case I have got to index all the documents in my repository  (which is
 in TB size) what would be the ideal architecture to follow , distributed
 SOLR ?

 Regards,
 JAME VAALET
 Software Developer
 EXT :8108
 Capital IQ




-- 
Thanks and Regards
Mohammad Shariq


Re: what s the optimum size of SOLR indexes

2011-07-04 Thread Toke Eskildsen
On Mon, 2011-07-04 at 13:51 +0200, Jame Vaalet wrote:
 What would be the maximum size of a single SOLR index file for resulting in 
 optimum search time ?

There is no clear answer. It depends on the number of (unique) terms,
number of documents, bytes on storage, storage speed, query complexity,
faceting, number of concurrent users and a lot of other factors.

 In case I have got to index all the documents in my repository  (which is in 
 TB size) what would be the ideal architecture to follow , distributed SOLR ?

A TB in source documents might very well end up as a simple, single
machine index of 100GB or less. It depends on the amount of search
relevant information in the documents, rather that their size in bytes.

If your sources are Word-documents or a similar format with a relatively
large amount of stuffing and your searches are mostly simple the user
enters 2-5 verbs and hits enter, my guess is that you don't need to
worry about distribution yet.

Make a pilot. Most of the work you'll have to do for a single machine
test can be reused for a distributed production setup.