Thanks Emir and Zisis. I added the maxRamMB for filterCache and reduced the size. I could the benefit immediately, the hit ratio went to 0.97. Here's the configuration:
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128" maxRamMB="500" /> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="128" /> <documentCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0" /> It seemed to be stable for few days, the cache hits and jvm pool utilization seemed to be well within expected range. But the OOM issue occurred on one of the nodes as the heap size reached 30gb. The hit ratio for query result cache and document cache at that point was recorded as 0.18 and 0.65. I'm not sure if the cache caused the memory spike at this point, with filter cache restricted to 500mb, it should be negligible. One thing I noticed is that the eviction rate now (with the addition of maxRamMB) is staying at 0. Index hard commit happens at every 10 min, that's when the cache gets flushed. Based on the monitoring log, the spike happened on the indexing side where almost 8k docs went to pending state. On the query performance standpoint, there have been occasional slow queries (1sec+), but nothing alarming so far. Same goes for deep paging, I haven't seen any evidence which points to that. Based on the hit ratio, I can further scale down the query result and document cache, also change to FastLRUCache and add maxRamMB. For filter cache, I think this setting should be optimal enough to work on a 30gb heap space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump somehow, unfortunately, the whole process (of the node going down) happens so quickly, I've hardly any time to run a profiler. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html