Quick followup. I have pushed the RPC timeout to 30s. Using Hector, I'm doing 1 thread doing batches of 10 mutates at a time so that's even slower than when I was doing 16 threads in parallel doing non-batched mutations. After a couple hundred execute() calls, I get a timeout for every node; I have a 15 second grace period between retries. tpstats indicate no pendings on any of the nodes. I never recover from that
I then set the batch size to one and it seems to work a lot better. The only difference I note is that the Mutator.execute() method returns a result than sometimes has a null host and 0 microsecond time in the batch sizes of ten but never in batch sizes of 1. I'm stumped ! Any ideas ? Thanks