Thanks a lot Ken for your inputs.

Regards,
Sourav

-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 08, 2008 12:41 PM
To: solr-user@lucene.apache.org
Subject: RE: Limitations of Distributed Search ....

>Any inputs on this would be really helpful. Looking for 
>suggestions/viewpoints from you guys.

One area where you might have issues is with date range queries. If 
you have many docs, then you can run into OOM errors. There was a 
recent thread about this, where Yonik (and others) had some good 
suggestions for ways to avoid this problem.

I don't know what the impact would be of merging results that use 
date ranges - I'm guessing low, but Yonik would know best.

As to how well a 50-server configuration would work...that would be 
about 200M docs/server, which is a large number even if the data/doc 
is small (1K). But the real performance is going to be heavily 
impacted by the nature of the data and the types of queries.

You'll also need to think about how you distribute the data to avoid 
skew, as performance is constrained by the worst case of any of the 
searchers. With Nutch we wound up having to add termination logic to 
avoid having long-running queries clog things up, primarily when 
dealing with load.

Best first step is to create a single Solr with representative data, 
and see how well that performs. My guess is your issues are going to 
be more around the limits of one box with 200M docs, versus the 
distributed nature. Though keeping 50-60 servers alive and happy is a 
significant ops task in itself.

Finally, you'd want to decide early on whether this is search or 
query. In other words, is it OK if a result set happens to be missing 
a doc, because that server is down or timed out. If it's not, then 
you're looking more at a query-type solution, where Solr would be 
less interesting.

-- Ken


>-----Original Message-----
>From: souravm
>Sent: Saturday, December 06, 2008 9:41 PM
>To: solr-user@lucene.apache.org
>Subject: Limitations of Distributed Search ....
>
>Hi,
>
>We are planning to use Solr for processing large volume of 
>application log files (around ~ 10 Billions documents of size 5-6 
>TB).
>
>One of the approach we are considering for the same is to use 
>Distributed Search extensively.
>
>What we have in mind is distributing the log files in multiple boxes 
>monthly or weekly basis - where at the weekly basis itself the 
>volume can go to the level of 200 M of documents. And a search query 
>can spread across all weeks (e.g. number of a given txn for 1st 6 
>months of a year)
>
>However, what we are not sure how well the distributed search would 
>scale when we may use around 50-60 boxes to distribute indexed 
>documents on weekly basis. The specific questions I have in mind are 
>-
>
>a) How would be the impact on the performance when a query spreads 
>over 50 boxes
>b) Is there any hard limit on the number of slaves which can be 
>contacted from the master server?
>c) How much load will this type of approach create on master server 
>for merging data, keeping the track whether a slave is down or not
>d) Any other manageability issues with so many slaves
>
>If anyone of you have deployed Solr in such a environment it would 
>be great if you can share your experience on the same.
>
>Thanks in advance.
>
>Regards,
>Sourav
>
>
>
>**************** CAUTION - Disclaimer *****************
>This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
>for the use of the addressee(s). If you are not the intended recipient, please
>notify the sender by e-mail and delete the original message. 
>Further, you are not
>to copy, disclose, or distribute this e-mail or its contents to any 
>other person and
>any such actions are unlawful. This e-mail may contain viruses. 
>Infosys has taken
>every reasonable precaution to minimize this risk, but is not liable 
>for any damage
>you may sustain as a result of any virus in this e-mail. You should 
>carry out your
>own virus checks before opening the e-mail or attachment. Infosys reserves the
>right to monitor and review the content of all messages sent to or 
>from this e-mail
>address. Messages sent to or from this e-mail address may be stored on the
>Infosys e-mail system.
>***INFOSYS******** End of Disclaimer ********INFOSYS***


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to