Hi all,

A couple of months ago, I migrated my solr deployment off of some legacy 
hardware (old spinning disks), and onto much newer hardware (SSD's, newer 
processors). While I am seeing much improved search performance since this 
move, I am also seeing intermittent indexing timeouts for 10-15 min periods 
about once a day or so (both from my indexing code and between replicas), which 
were not happening before. I have been scratching my head trying to figure out 
why, but have thus far been unsuccessful. I was hoping someone on here could 
maybe offer some thoughts as to how to further debug.

Some information about my setup:
-Solr Cloud 8.3, running on linux
-2 nodes, 1 shard (2 replicas) per collection
-Handful of collections, maxing out in the 10s of millions of docs per 
collection. Less than 100 million docs total
-Nodes are 8 CPU cores with SSD storage. 64 GB of RAM on server, heap size of 
26 GB.
-Relatively aggressive NRT tuning (hard commit 60 sec, soft commit 15 sec).
-Multi-threaded indexing process using SolrJ CloudSolrClient, sending updates 
in batches of ~1000 docs
-Indexing and querying is done constantly throughout the day

The indexing process, heap sizes, and soft/hard commit intervals were carefully 
tuned for my original setup, and were working flawlessly until the hardware 
change. It's only since the move to faster hardware/SSDs that I am now seeing 
timeouts during indexing (maybe counter-intuitively).

My first thought was that I was having stop the world GC pauses which were 
causing the timeouts, but when I captured GC logs during one of the timeout 
windows and ran it through a log analyzer, there were no issues detected. 
Largest GC pause was under 1 second. I monitor the heap continuously, and I 
always sit between 15-20 GB of 26 GB used...so I don't think that my heap is 
too small necessarily.

My next thought was that maybe it had to do with segment merges happening in 
the background, causing indexing to block. I am using the dynamic defaults for 
the merge scheduler, which almost certainly changed when I moved hardware 
(since now it is detecting a non-spinning disk, and my understanding is that 
the max concurrent merges is set based on this). I have been unable to confirm 
this though. I do not see any merge warnings or errors in the logs, and I have 
thus far been able to catch it in action to try and confirm via a thread dump.

Interestingly, when I did take a thread dump during normal execution, I noticed 
that one of my nodes has a huge number of running threads (~1700) compared to 
the other node (~150). Most of the threads are updateExecutor threads that 
appear to be permanently in a waiting state. I'm not sure what causes the node 
to get into this state, or if it is related to the timeouts at all.

I have thus far been unable to replicate the issue in a test environment, so 
it's hard to trial and error possible solutions. Does anyone have any 
suggestions on what could be causing these timeouts all of a sudden, or tips on 
how to debug further?

Thanks!

Reply via email to