Hi All,

There is a requirement in our group of indexing and searching several
millions of documents (TREC) in real-time and millisecond responses.
For the moment we are preferring scale-out (throw more commodity
machines) approaches rather than scale-up (faster disks, more
RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
(mail me if you want a copy) in which it was proven that this kind of
distribution scales better and is more resilient.

So, are there any resources available (Wiki, Tutorials, Slides, README
etc.) that throw light and guide newbies on how to run Solr in a
multi-machine scenario? I have gone through the mailing lists and site
but could not really find any answers or hands-on stuff to do so. An
adhoc guideline to get things working with 2 machines might just be
enough but for the sake of thinking out loud and solicit responses
from the list, here are my questions:

1) Solr that has to handle a fairly large index which has to be split
up on multiple disks (using Multicore?)
- Space is not a problem since we can use NFS but that is not
recommended as we would only exploit 1 processor
2) Solr that has to handle a large collective index which has to be
split up on multi-machines
- The index is ever increasing (TB scale) and dynamic and all of it
has to be searched at any point
3) Solr that has to exploit multi-machines because we have plenty of
them in a tightly coupled P2P scenario
- Machines are not a problem but will they be if they are of varied
configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
1.1 to 1.6)
4) Solr that has to distribute load on several machines
- The index(s) could be common though like say using a distributed
filesystem (Hadoop?)

In each the above cases (we might use all of these strategies at
various use cases) the application should use Solr as a strict backend
and named service (IP or host:port) so that we can expose this
application (and the service) to the web or intranet. Machine failures
should be tolerated too. Also, does Solr manage load balancing out of
the box if it was indeed configured to work with multi-machines?

Maybe it is superfluous but is Solr and/or Nutch the only way to use
Lucene in a multi-machine environment? Or is there some hidden
document/project somewhere that makes it possible by exposing a
regular Lucene process over the network using RMI or something? It is
my understanding (could be wrong) that Nutch and to some extent, Solr
do not perform well when there is a lot of indexing activity in
parallel to search. Batch processing is also there and perhaps we can
use Nutch/Solr there. Even so, we need multi-machine directions.

I am sure that multi-machines make possible for a lot of other ways
which might solve the goal better and that others have practical
experience on. So, any advise and tips are also very welcome. We
intend to document things and do some benchmarking along the way in
the open spirit.

Really sorry for the length but I hope some answers are forthcoming.

Cheers,
Srikant

Reply via email to