I need some recommendations for a new SOLR project.

We currently have a large (200M docs) production system using Lucene.Net and 
what I would call our own .NET implementation of SOLR (built early on when SOLR 
was less mature and did not run as well on Windows).  

Our current architecture works like this:

We have 2 master updater servers, which build Lucene indexes, and they both are 
live and reading new documents to index from a shared queue (RabbitMQ).  Unlike 
SOLR we don't POST new documents to the servers via HTTP, instead we let the 
servers read from the shared queue and index documents as fast as they can.  
This gives us scale and failover support, and basic natural load balancing.  
Each updater makes a new index snapshot on some configurable interval 
(typically 1 minute).  During index snapshot, the indexing thread is blocked so 
that new documents are not read from queue during that time.  A new snapshot 
uses NTFS hard links to replicate the master index into a new directory on the 
master server.

We have 4 searcher servers, which each pull snapshots from both updaters and 
then search both indexes in parallel and merge results.  Each searcher server 
is identical (they search same data), and sits behind a load balancer.  
Searchers have a background thread which continuously looks in remote 
directories on master servers looking for new snapshot directories.  When it 
sees a new snapshot exists, it uses NTFS hard links to replicate the current 
local snapshot and pull down only files that changed (very similar to SOLR 
collection replication).  Then switches searches over to new snapshot(s).

From the search client perspective, they simply issue HTTP GET request to the 
load balancer, and have no idea how many masters/shards that exist.  Results 
are merged/resorted/de-duped from all indexes transparently on the searchers.

My big question is, how can I very closely replicate this architecture with the 
latest version of SOLR?  We don't need to replace this system but to implement 
a similar system for another client using SOLR.  We really like our existing 
system architecture because it provides very natural load balancing and 
sharding on the masters and provides nice failover support for both masters and 
searchers.  We'd like to avoid search clients knowing about shards, and to 
avoid explicitly posting HTTP requests to SOLR servers when adding documents, 
because we like the more natural failover and load balancing of using a shared 
queue.

More specific questions:

How can SOLR in master/updater mode be configured to read new documents from a 
queue?  Is this possible with a custom Data Import Handler, or do I need to 
develop some seperate service which reads from a queue and then POSTS via HTTP 
to SOLR?

What is the best way to configure searchers to be able to merge results from 
more than one index?  Is it possible to configure which shards to search by 
default, rather than forcing the client to know about shards and specify shards 
in search request?  Can I do this with a custom request handler, and if so 
specifically how?

Thanks a lot!

Bob

Reply via email to