First, you shouldn't be using HttpSolrClient, use CloudSolrServer (CloudSolrClient in 5.x). That takes the ZK address and routes the docs to the leader, reducing the network hops docs have to go through. AFAIK, in cloud setups it is in every way superior to http.
I'm guessing your docs aren't huge. You haven't really told us what "high indexing rates" and "high query rates" are in your environment, so it's hard to say much. For comparison I get 2-3K docs/sec on my laptop (no query load though). The most frequent problem for nodes going into recovery in this scenario is the ZK timeout being exceeded. This is often triggered by excessive GC pauses, some more details would help here: How much memory are you allocating to Solr? Have you turned on GC logging to see whether you're getting "stop the world" GC pauses? What rates _are_ you seeing? Personally, I'd concentrate on the nodes going into recovery before anything else. Until that's fixed any other things you do will not be predictive of much. BTW, I typically start with batch sizes of 1,000 FWIW. Sometimes that's too big, sometimes too small but it seems pretty reasonable most of the time. Best, Erick On Thu, Apr 30, 2015 at 12:20 PM, Vinay Pothnis <poth...@gmail.com> wrote: > Hello, > > I have a usecase with the following characteristics: > > - High index update rate (adds/updates) > - High query rate > - Low index size (~800MB for 2.4Million docs) > - The documents that are created at the high rate eventually "expire" and > are deleted regularly at half hour intervals > > I currently have a solr cloud set up with 1 shard and 4 replicas. > * My index updates are sent to a VIP/loadbalancer (round robins to one of > the 4 solr nodes) > * I am using http client to send the updates > * Using batch size of 100 and 8 to 10 threads sending the batch of updates > to solr. > > When I try to run tests to scale out the indexing rate, I see the following: > * solr nodes go into recovery > * updates are taking really long to complete. > > As I understand, when a node receives an update: > * If it is the leader, it forwards the update to all the replicas and > waits until it receives the reply from all of them before replying back to > the client that sent the reply. > * If it is not the leader, it forwards the update to the leader, which > THEN does the above steps mentioned. > > How do I go about scaling the index updates: > * As I add more replicas, my updates would get slower and slower? > * Is there a way I can configure the leader to wait for say N out of M > replicas only? > * Should I be targeting the updates to only the leader? > * Any other approach i should be considering? > > Thanks > Vinay