I'm looking forward some strategy to build a large Lucene distributed index,
but after read a lot of mail threads, it seems that Lucene hasn't a 'default
solution' for that yet.

The search engine that I'm building has a 400 GB text-database (right now,
growing every day and without document deleting operation) and thousands
queries per day (a very small time of response is needed). I expected to use
a commodity computers cluster as hardware infrastructure and the distributed
strategy must allow concurrency/parallel processing/searching with all
processors.

Searching on Lucene site, I thought Hadoop was the right choice. After read
some mail threads about this (at java lucene and hadoop, dev and user), it
seems that Hadoop isn't that good (let's put this way) to deal with
distributed indexes for a search engine.

So, I tried Solr and read about FederatedSearch and CollectionDistribution.
An 'all-machines-have-complete-index' strategy (using rsync) can improve
system throughput and concurrency by each station processing different
queries, but each query will spend the same amount of time that a
single-node system (what sucks).

When each of a N-station cluster indexing 1/N of text collection, each will
machine spend less time processing queries, but all machines must process
the same query at the same time (a 'goodbye, concurrency', IMO), then merge
results.

Neither of them looks good for me.

The 'Multiple Masters' (described on FederatedSearch) solution looks good.
"There could be a master for each slice of the index. An external module
could provide the update interface and forward the request to the correct
master based on the unique key field". That's great, has high scalability,
high concurrency, smaller index for each node... I read some papers with
very similar solution [1][2][3] and discussing how to solve some of this
strategy issues.

Did I get anything wrong (about Hadoop and Solr)?

Is Multiple Masters/FederatedSearch under development? What status? Or did I
should develop it for myself?

Thanks for any help.
Daniel

[1] B. Ribeiro-Neto, J. Kitajima, G. Navarro e N. Ziviani. Parallel
Generation of Inverted Files for Distributed Text Collections. In
Proceedings of the XVIII international Conference of the Chilean Computer
Science Society. IEEE Computer Society Washington, 1998.

[2] C. Badue, R. Baeza-Yates, B. Ribeiro-Neto e N. Ziviani. Distributed
Query Processing Using Partitioned Inverted Files. Eighth Symposium on
String Processing and Information Retrieval (SPIRE'01), 2001.

[3] C. Badue, R. Barbosa, B. Ribeiro-Neto e N. Ziviani. Basic issues on the
processing of web queries. In Proceedings of the 28th Annual international
ACM SIGIR Conference on Research and Development in information Retrieval,
2005.

Reply via email to