Here’s an earlier post where I mentioned some GC investigation tools:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c8f8fa32d-ec0e-4352-86f7-4b2d8a906...@whitepages.com%3E

In my experience, there are many aspects of the Solr/Lucene memory allocation 
model that scale with things other than documents returned. (such as 
cardinality, or simply index size) A single query on a large index might 
consume dozens of megabytes of heap to complete. But that heap should also be 
released quickly after the query finishes.
The key characteristic of a memory leak is that the software is allocating 
memory that it cannot reclaim. If it’s a leak, you ought to be able to 
reproduce it at any query rate - have you tried this? A run with, say, half the 
rate, over twice the duration?

I’m inclined to agree with others here, that although you’ve correctly 
attributed the cause to GC, it’s probably less an indication of a leak, and 
more an indication of simply allocating memory faster than it can be reclaimed, 
combined with the long pauses that are increasingly unavoidable as heap size 
goes up.
Note that in the case of a CMS allocation failure, the fallback full-GC is 
*single threaded*, which means it’ll usually take considerably longer than a 
normal GC - even for a comparable amount of garbage.

In addition to GC tuning, you can address these by sharding more, both at the 
core and jvm level.


On 12/4/16, 3:46 PM, "Shawn Heisey" <apa...@elyograg.org> wrote:

    On 12/3/2016 9:46 PM, S G wrote:
    > The symptom we see is that the java clients querying Solr see response
    > times in 10s of seconds (not milliseconds).
    <snip>
    > Some numbers for the Solr Cloud:
    >
    > *Overall infrastructure:*
    > - Only one collection
    > - 16 VMs used
    > - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
    >
    > *Overview from one core:*
    > - Num Docs:193,623,388
    > - Max Doc:230,577,696
    > - Heap Memory Usage:231,217,880
    > - Deleted Docs:36,954,308
    > - Version:2,357,757
    > - Segment Count:37
    
    The heap memory usage number isn't useful.  It doesn't cover all the
    memory used.
    
    > *Stats from QueryHandler/select*
    > - requests:78,557
    > - errors:358
    > - timeouts:0
    > - totalTime:1,639,975.27
    > - avgRequestsPerSecond:2.62
    > - 5minRateReqsPerSecond:1.39
    > - 15minRateReqsPerSecond:1.64
    > - avgTimePerRequest:20.87
    > - medianRequestTime:0.70
    > - 75thPcRequestTime:1.11
    > - 95thPcRequestTime:191.76
    
    These times are in *milliseconds*, not seconds .. and these are even
    better numbers than you showed before.  Where are you seeing 10 plus
    second query times?  Solr is not showing numbers like that.
    
    If your VM host has 16 VMs on it and each one has a total memory size of
    92GB, then if that machine doesn't have 1.5 terabytes of memory, you're
    oversubscribed, and this is going to lead to terrible performance... but
    the numbers you've shown here do not show terrible performance.
    
    > Plus, on every server, we are seeing lots of exceptions.
    > For example:
    >
    > Between 8:06:55 PM and 8:21:36 PM, exceptions are:
    >
    > 1) Request says it is coming from leader, but we are the leader:
    > 
update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2
    >
    > 2) org.apache.solr.common.SolrException: Request says it is coming from
    > leader, but we are the leader
    >
    > 3) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 4) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 5) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 6) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 7) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
    > available to handle this request. Zombie server list:
    > [HOSTA_ca_1_1456429897]
    >
    > 8) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
    > available to handle this request. Zombie server list:
    > [HOSTA_ca_1_1456429897]
    >
    > 9) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 10) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 11) org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    >
    > 12) null:org.apache.solr.common.SolrException:
    > org.apache.solr.client.solrj.SolrServerException: Tried one server for 
read
    > operation and it timed out, so failing fast
    
    These errors sound like timeouts, possibly caused by long GC pauses ...
    but as already mentioned, the query handler statistics do not indicate
    long query times.  If a long GC were to happen during a query, then the
    query time would be long as well.
    
    The core information above doesn't include the size of the index on
    disk.  That number would be useful for telling you whether there's
    enough memory.
    
    As I said at the beginning of the thread, I haven't seen anything here
    to indicate a memory leak, and others are using version 4.10 without any
    problems.  If there were a memory leak in a released version of Solr,
    many people would have run into problems with it.
    
    Thanks,
    Shawn
    
    

Reply via email to