I need some recommendations for a new SOLR project. We currently have a large (200M docs) production system using Lucene.Net and what I would call our own .NET implementation of SOLR (built early on when SOLR was less mature and did not run as well on Windows).
Our current architecture works like this: We have 2 master updater servers, which build Lucene indexes, and they both are live and reading new documents to index from a shared queue (RabbitMQ). Unlike SOLR we don't POST new documents to the servers via HTTP, instead we let the servers read from the shared queue and index documents as fast as they can. This gives us scale and failover support, and basic natural load balancing. Each updater makes a new index snapshot on some configurable interval (typically 1 minute). During index snapshot, the indexing thread is blocked so that new documents are not read from queue during that time. A new snapshot uses NTFS hard links to replicate the master index into a new directory on the master server. We have 4 searcher servers, which each pull snapshots from both updaters and then search both indexes in parallel and merge results. Each searcher server is identical (they search same data), and sits behind a load balancer. Searchers have a background thread which continuously looks in remote directories on master servers looking for new snapshot directories. When it sees a new snapshot exists, it uses NTFS hard links to replicate the current local snapshot and pull down only files that changed (very similar to SOLR collection replication). Then switches searches over to new snapshot(s). From the search client perspective, they simply issue HTTP GET request to the load balancer, and have no idea how many masters/shards that exist. Results are merged/resorted/de-duped from all indexes transparently on the searchers. My big question is, how can I very closely replicate this architecture with the latest version of SOLR? We don't need to replace this system but to implement a similar system for another client using SOLR. We really like our existing system architecture because it provides very natural load balancing and sharding on the masters and provides nice failover support for both masters and searchers. We'd like to avoid search clients knowing about shards, and to avoid explicitly posting HTTP requests to SOLR servers when adding documents, because we like the more natural failover and load balancing of using a shared queue. More specific questions: How can SOLR in master/updater mode be configured to read new documents from a queue? Is this possible with a custom Data Import Handler, or do I need to develop some seperate service which reads from a queue and then POSTS via HTTP to SOLR? What is the best way to configure searchers to be able to merge results from more than one index? Is it possible to configure which shards to search by default, rather than forcing the client to know about shards and specify shards in search request? Can I do this with a custom request handler, and if so specifically how? Thanks a lot! Bob