Re: Java GC issue investigation

2020-10-07 Thread Walter Underwood
First thing is to stop using CMS and use G1GC.

We’ve been using these settings with over a hundred machines
in prod for nearly four years.

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2020, at 2:39 AM, Karol Grzyb  wrote:
> 
> Hi Matthew, Erick!
> 
> Thank you very much for the feedback, I'll try to convince them to
> reduce the heap size.
> 
> current GC settings:
> 
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:ConcGCThreads=4
> -XX:MaxTenuringThreshold=8
> -XX:NewRatio=3
> -XX:ParallelGCThreads=4
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90
> 
> Kind regards,
> Karol
> 
> 
> wt., 6 paź 2020 o 16:52 Erick Erickson  napisał(a):
>> 
>> 12G is not that huge, it’s surprising that you’re seeing this problem.
>> 
>> However, there are a couple of things to look at:
>> 
>> 1> If you’re saying that you have 16G total physical memory and are 
>> allocating 12G to Solr, that’s an anti-pattern. See:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> If at all possible, you should allocate between 25% and 50% of your physical 
>> memory to Solr...
>> 
>> 2> what garbage collector are you using? G1GC might be a better choice.
>> 
>>> On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
>>> 
>>> Your index is so small that it should easily get cached into OS memory
>>> as it is accessed.  Having a too-big heap is a known problem
>>> situation.
>>> 
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
>>> 
>>> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
 
 Hi Matthew,
 
 Thank you for the answer, I cannot reproduce the setup locally I'll
 try to convince them to reduce Xmx, I guess they will rather not agree
 to 1GB but something less than 12G for sure.
 And have some proper dev setup because for now we could only test prod
 or stage which are difficult to adjust.
 
 Is being stuck in GC common behaviour when the index is small compared
 to available heap during bigger load? I was more worried about the
 ratio of heap to total host memory.
 
 Regards,
 Karol
 
 
 wt., 6 paź 2020 o 14:39 matthew sporleder  
 napisał(a):
> 
> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> to, like, 1g ?
> 
> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
>> 
>> Hi,
>> 
>> I'm involved in investigation of issue that involves huge GC overhead
>> that happens during performance tests on Solr Nodes. Solr version is
>> 6.1. Last test were done on staging env, and we run into problems for
>> <100 requests/second.
>> 
>> The size of the index itself is ~200MB ~ 50K docs
>> Index has small updates every 15min.
>> 
>> 
>> 
>> Queries involve sorting and faceting.
>> 
>> I've gathered some heap dumps, I can see from them that most of heap
>> memory is retained because of object of following classes:
>> 
>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
>> (>4G, 91% of heap)
>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
>> (>3.7G 76% of heap)
>> 
>> 
>> 
>> Based on information above is there anything generic that can been
>> looked at as source of potential improvement without diving deeply
>> into schema and queries (which may be very difficlut to change at this
>> moment)? I don't see docvalues being enabled - could this help, as if
>> I get the docs correctly, it's specifically helpful when there are
>> many sorts/grouping/facets? Or I
>> 
>> Additionaly I see, that many threads are blocked on LRUCache.get,
>> should I recomend switching to FastLRUCache?
>> 
>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
>> memory? I see some (~5/s) page faults in Dynatrace during the biggest
>> traffic.
>> 
>> Thank you very much for any help,
>> Kind regards,
>> Karol
>> 



Re: Java GC issue investigation

2020-10-07 Thread Karol Grzyb
Hi Matthew, Erick!

Thank you very much for the feedback, I'll try to convince them to
reduce the heap size.

current GC settings:

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90

Kind regards,
Karol


wt., 6 paź 2020 o 16:52 Erick Erickson  napisał(a):
>
> 12G is not that huge, it’s surprising that you’re seeing this problem.
>
> However, there are a couple of things to look at:
>
> 1> If you’re saying that you have 16G total physical memory and are 
> allocating 12G to Solr, that’s an anti-pattern. See:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> If at all possible, you should allocate between 25% and 50% of your physical 
> memory to Solr...
>
> 2> what garbage collector are you using? G1GC might be a better choice.
>
> > On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
> >
> > Your index is so small that it should easily get cached into OS memory
> > as it is accessed.  Having a too-big heap is a known problem
> > situation.
> >
> > https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> >
> > On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
> >>
> >> Hi Matthew,
> >>
> >> Thank you for the answer, I cannot reproduce the setup locally I'll
> >> try to convince them to reduce Xmx, I guess they will rather not agree
> >> to 1GB but something less than 12G for sure.
> >> And have some proper dev setup because for now we could only test prod
> >> or stage which are difficult to adjust.
> >>
> >> Is being stuck in GC common behaviour when the index is small compared
> >> to available heap during bigger load? I was more worried about the
> >> ratio of heap to total host memory.
> >>
> >> Regards,
> >> Karol
> >>
> >>
> >> wt., 6 paź 2020 o 14:39 matthew sporleder  
> >> napisał(a):
> >>>
> >>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> >>> to, like, 1g ?
> >>>
> >>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> 
>  Hi,
> 
>  I'm involved in investigation of issue that involves huge GC overhead
>  that happens during performance tests on Solr Nodes. Solr version is
>  6.1. Last test were done on staging env, and we run into problems for
>  <100 requests/second.
> 
>  The size of the index itself is ~200MB ~ 50K docs
>  Index has small updates every 15min.
> 
> 
> 
>  Queries involve sorting and faceting.
> 
>  I've gathered some heap dumps, I can see from them that most of heap
>  memory is retained because of object of following classes:
> 
>  -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
>  (>4G, 91% of heap)
>  -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
>  -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
>  -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
>  (>3.7G 76% of heap)
> 
> 
> 
>  Based on information above is there anything generic that can been
>  looked at as source of potential improvement without diving deeply
>  into schema and queries (which may be very difficlut to change at this
>  moment)? I don't see docvalues being enabled - could this help, as if
>  I get the docs correctly, it's specifically helpful when there are
>  many sorts/grouping/facets? Or I
> 
>  Additionaly I see, that many threads are blocked on LRUCache.get,
>  should I recomend switching to FastLRUCache?
> 
>  Also, I wonder if -Xmx12288m for java heap is not too much for 16G
>  memory? I see some (~5/s) page faults in Dynatrace during the biggest
>  traffic.
> 
>  Thank you very much for any help,
>  Kind regards,
>  Karol
>


Re: Java GC issue investigation

2020-10-06 Thread Erick Erickson
12G is not that huge, it’s surprising that you’re seeing this problem.

However, there are a couple of things to look at:

1> If you’re saying that you have 16G total physical memory and are allocating 
12G to Solr, that’s an anti-pattern. See: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If at all possible, you should allocate between 25% and 50% of your physical 
memory to Solr...

2> what garbage collector are you using? G1GC might be a better choice.

> On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
> 
> Your index is so small that it should easily get cached into OS memory
> as it is accessed.  Having a too-big heap is a known problem
> situation.
> 
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> 
> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>> 
>> Hi Matthew,
>> 
>> Thank you for the answer, I cannot reproduce the setup locally I'll
>> try to convince them to reduce Xmx, I guess they will rather not agree
>> to 1GB but something less than 12G for sure.
>> And have some proper dev setup because for now we could only test prod
>> or stage which are difficult to adjust.
>> 
>> Is being stuck in GC common behaviour when the index is small compared
>> to available heap during bigger load? I was more worried about the
>> ratio of heap to total host memory.
>> 
>> Regards,
>> Karol
>> 
>> 
>> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
>>> 
>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>> to, like, 1g ?
>>> 
>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
 
 Hi,
 
 I'm involved in investigation of issue that involves huge GC overhead
 that happens during performance tests on Solr Nodes. Solr version is
 6.1. Last test were done on staging env, and we run into problems for
 <100 requests/second.
 
 The size of the index itself is ~200MB ~ 50K docs
 Index has small updates every 15min.
 
 
 
 Queries involve sorting and faceting.
 
 I've gathered some heap dumps, I can see from them that most of heap
 memory is retained because of object of following classes:
 
 -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
 (>4G, 91% of heap)
 -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
 -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
 -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
 (>3.7G 76% of heap)
 
 
 
 Based on information above is there anything generic that can been
 looked at as source of potential improvement without diving deeply
 into schema and queries (which may be very difficlut to change at this
 moment)? I don't see docvalues being enabled - could this help, as if
 I get the docs correctly, it's specifically helpful when there are
 many sorts/grouping/facets? Or I
 
 Additionaly I see, that many threads are blocked on LRUCache.get,
 should I recomend switching to FastLRUCache?
 
 Also, I wonder if -Xmx12288m for java heap is not too much for 16G
 memory? I see some (~5/s) page faults in Dynatrace during the biggest
 traffic.
 
 Thank you very much for any help,
 Kind regards,
 Karol



Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
Your index is so small that it should easily get cached into OS memory
as it is accessed.  Having a too-big heap is a known problem
situation.

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?

On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>
> Hi Matthew,
>
> Thank you for the answer, I cannot reproduce the setup locally I'll
> try to convince them to reduce Xmx, I guess they will rather not agree
> to 1GB but something less than 12G for sure.
> And have some proper dev setup because for now we could only test prod
> or stage which are difficult to adjust.
>
> Is being stuck in GC common behaviour when the index is small compared
> to available heap during bigger load? I was more worried about the
> ratio of heap to total host memory.
>
> Regards,
> Karol
>
>
> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
> >
> > You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> > to, like, 1g ?
> >
> > On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> > >
> > > Hi,
> > >
> > > I'm involved in investigation of issue that involves huge GC overhead
> > > that happens during performance tests on Solr Nodes. Solr version is
> > > 6.1. Last test were done on staging env, and we run into problems for
> > > <100 requests/second.
> > >
> > > The size of the index itself is ~200MB ~ 50K docs
> > > Index has small updates every 15min.
> > >
> > >
> > >
> > > Queries involve sorting and faceting.
> > >
> > > I've gathered some heap dumps, I can see from them that most of heap
> > > memory is retained because of object of following classes:
> > >
> > > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > > (>4G, 91% of heap)
> > > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > > (>3.7G 76% of heap)
> > >
> > >
> > >
> > > Based on information above is there anything generic that can been
> > > looked at as source of potential improvement without diving deeply
> > > into schema and queries (which may be very difficlut to change at this
> > > moment)? I don't see docvalues being enabled - could this help, as if
> > > I get the docs correctly, it's specifically helpful when there are
> > > many sorts/grouping/facets? Or I
> > >
> > > Additionaly I see, that many threads are blocked on LRUCache.get,
> > > should I recomend switching to FastLRUCache?
> > >
> > > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > > traffic.
> > >
> > > Thank you very much for any help,
> > > Kind regards,
> > > Karol


Re: Java GC issue investigation

2020-10-06 Thread Karol Grzyb
Hi Matthew,

Thank you for the answer, I cannot reproduce the setup locally I'll
try to convince them to reduce Xmx, I guess they will rather not agree
to 1GB but something less than 12G for sure.
And have some proper dev setup because for now we could only test prod
or stage which are difficult to adjust.

Is being stuck in GC common behaviour when the index is small compared
to available heap during bigger load? I was more worried about the
ratio of heap to total host memory.

Regards,
Karol


wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
>
> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> to, like, 1g ?
>
> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> >
> > Hi,
> >
> > I'm involved in investigation of issue that involves huge GC overhead
> > that happens during performance tests on Solr Nodes. Solr version is
> > 6.1. Last test were done on staging env, and we run into problems for
> > <100 requests/second.
> >
> > The size of the index itself is ~200MB ~ 50K docs
> > Index has small updates every 15min.
> >
> >
> >
> > Queries involve sorting and faceting.
> >
> > I've gathered some heap dumps, I can see from them that most of heap
> > memory is retained because of object of following classes:
> >
> > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > (>4G, 91% of heap)
> > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > (>3.7G 76% of heap)
> >
> >
> >
> > Based on information above is there anything generic that can been
> > looked at as source of potential improvement without diving deeply
> > into schema and queries (which may be very difficlut to change at this
> > moment)? I don't see docvalues being enabled - could this help, as if
> > I get the docs correctly, it's specifically helpful when there are
> > many sorts/grouping/facets? Or I
> >
> > Additionaly I see, that many threads are blocked on LRUCache.get,
> > should I recomend switching to FastLRUCache?
> >
> > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > traffic.
> >
> > Thank you very much for any help,
> > Kind regards,
> > Karol


Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
You have a 12G heap for a 200MB index?  Can you just try changing Xmx
to, like, 1g ?

On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
>
> Hi,
>
> I'm involved in investigation of issue that involves huge GC overhead
> that happens during performance tests on Solr Nodes. Solr version is
> 6.1. Last test were done on staging env, and we run into problems for
> <100 requests/second.
>
> The size of the index itself is ~200MB ~ 50K docs
> Index has small updates every 15min.
>
>
>
> Queries involve sorting and faceting.
>
> I've gathered some heap dumps, I can see from them that most of heap
> memory is retained because of object of following classes:
>
> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> (>4G, 91% of heap)
> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> (>3.7G 76% of heap)
>
>
>
> Based on information above is there anything generic that can been
> looked at as source of potential improvement without diving deeply
> into schema and queries (which may be very difficlut to change at this
> moment)? I don't see docvalues being enabled - could this help, as if
> I get the docs correctly, it's specifically helpful when there are
> many sorts/grouping/facets? Or I
>
> Additionaly I see, that many threads are blocked on LRUCache.get,
> should I recomend switching to FastLRUCache?
>
> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> memory? I see some (~5/s) page faults in Dynatrace during the biggest
> traffic.
>
> Thank you very much for any help,
> Kind regards,
> Karol


Java GC issue investigation

2020-10-06 Thread Karol Grzyb
Hi,

I'm involved in investigation of issue that involves huge GC overhead
that happens during performance tests on Solr Nodes. Solr version is
6.1. Last test were done on staging env, and we run into problems for
<100 requests/second.

The size of the index itself is ~200MB ~ 50K docs
Index has small updates every 15min.



Queries involve sorting and faceting.

I've gathered some heap dumps, I can see from them that most of heap
memory is retained because of object of following classes:

-org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
(>4G, 91% of heap)
-org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
-org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
-org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
(>3.7G 76% of heap)



Based on information above is there anything generic that can been
looked at as source of potential improvement without diving deeply
into schema and queries (which may be very difficlut to change at this
moment)? I don't see docvalues being enabled - could this help, as if
I get the docs correctly, it's specifically helpful when there are
many sorts/grouping/facets? Or I

Additionaly I see, that many threads are blocked on LRUCache.get,
should I recomend switching to FastLRUCache?

Also, I wonder if -Xmx12288m for java heap is not too much for 16G
memory? I see some (~5/s) page faults in Dynatrace during the biggest
traffic.

Thank you very much for any help,
Kind regards,
Karol