Re: Best way to track cumulative GC pauses in Solr
On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwoodwrote: > Also, what GC settings are you using? We may be able to make some suggestions. > > Cumulative GC pauses aren’t very interesting to me. I’m more interested in > the longest ones, 90th percentile, 95th, etc. > Any advice would be great, but what I'm primarily interested in is how people are monitoring these statistics in real time, for all time, on production servers. Eg, for looking at the disk or RAM usage of one of my servers, I can look at the historical usage in the last week, last month, last year and so on. I need to get these stats in to the same monitoring tools as we use for monitoring every other vital aspect of our servers. Looking at log files can be useful, but I don't want to keep arbitrarily large log files on our servers, nor extract data from them, I want to record it for posterity in one system that understands sampling. We already use and maintain our own munin systems, so I'm not interested in paid-for equivalents of munin - regardless of how simple to set up they are, they don't integrate with our other performance monitoring stats, and I would never get budget anyway. So really: 1) Is it OK to turn JMX monitoring on on production systems? The comments in solr.in.sh suggest not. 2) What JMX beans and attributes should I be using to monitor GC pauses, particularly maximum length of a single pause in a period, and the total length of pauses in that period? Cheers Tom
Best way to track cumulative GC pauses in Solr
Hi all We have some issues with our Solr servers spending too much time paused doing GC. From turning on gc debug, and extracting numbers from the GC log, we're getting an idea of just how much of a problem. I'm currently doing this in a hacky, inefficient way: grep -h 'Total time for which application threads were stopped:' solr_gc* \ | awk '($11 > 0.3) { print $1, $11 }' \ | sed 's#:.*:##' \ | sort -n \ | sum_by_date.py (Yes, I really am using sed, grep and awk all in one line. Just wrong :) The "sum_by_date.py" program simply adds up all the values with the same first column, and remembers the largest value seen. This is giving me the cumulative GC time for extended pauses (over 0.5s), and the maximum pause seen in a given time period (hourly), eg: 2015-11-13T11 119.124037 2.203569 2015-11-13T12 184.683309 3.156565 2015-11-13T13 65.934526 1.978202 2015-11-13T14 63.970378 1.411700 This is fine for seeing that we have a problem. However, really I need to get this in to our monitoring systems - we use munin. I'm struggling to work out the best way to extract this information for our monitoring systems, and I think this might be my naivety about Java, and working out what should be logged. I've turned on JMX debugging, and looking at the different beans available using jconsole, but I'm drowning in information. What would be the best thing to monitor? Ideally, like the stats above, I'd like to know the cumulative time spent paused in GC since the last poll, and the longest GC pause that we see. munin polls every 5 minutes, are there suitable counters exposed by JMX that it could extract? Thanks in advance Tom
Re: Best way to track cumulative GC pauses in Solr
On 11/13/2015 8:00 AM, Tom Evans wrote: > We have some issues with our Solr servers spending too much time > paused doing GC. From turning on gc debug, and extracting numbers from > the GC log, we're getting an idea of just how much of a problem. Try loading your gc log into gcviewer. https://github.com/chewiebug/GCViewer/releases Here's a screenshot of this in action with a gc log from Solr loaded: https://www.dropbox.com/s/orwt0fcmii5691l/solr-gc-gcviewer-1.35-snapshot.png?dl=0 This screenshot is from a snapshot build including a feature request that I made: https://github.com/chewiebug/GCViewer/issues/139 If you use the 1.34.1 version, you will not see some of the numbers shown in my screenshot, but the info you asked for, accumulated GC pauses, IS included in that version. Thanks, Shawn
Re: Best way to track cumulative GC pauses in Solr
Also, what GC settings are you using? We may be able to make some suggestions. Cumulative GC pauses aren’t very interesting to me. I’m more interested in the longest ones, 90th percentile, 95th, etc. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 13, 2015, at 8:32 AM, Shawn Heiseywrote: > > On 11/13/2015 8:00 AM, Tom Evans wrote: >> We have some issues with our Solr servers spending too much time >> paused doing GC. From turning on gc debug, and extracting numbers from >> the GC log, we're getting an idea of just how much of a problem. > > Try loading your gc log into gcviewer. > > https://github.com/chewiebug/GCViewer/releases > > Here's a screenshot of this in action with a gc log from Solr loaded: > > https://www.dropbox.com/s/orwt0fcmii5691l/solr-gc-gcviewer-1.35-snapshot.png?dl=0 > > This screenshot is from a snapshot build including a feature request > that I made: > > https://github.com/chewiebug/GCViewer/issues/139 > > If you use the 1.34.1 version, you will not see some of the numbers > shown in my screenshot, but the info you asked for, accumulated GC > pauses, IS included in that version. > > Thanks, > Shawn >
Re: Best way to track cumulative GC pauses in Solr
Hi Tom, SPM for SOLR should be helpful here. See http://sematext.com/spm Otis > On Nov 13, 2015, at 10:00, Tom Evanswrote: > > Hi all > > We have some issues with our Solr servers spending too much time > paused doing GC. From turning on gc debug, and extracting numbers from > the GC log, we're getting an idea of just how much of a problem. > > I'm currently doing this in a hacky, inefficient way: > > grep -h 'Total time for which application threads were stopped:' solr_gc* \ >| awk '($11 > 0.3) { print $1, $11 }' \ >| sed 's#:.*:##' \ >| sort -n \ >| sum_by_date.py > > (Yes, I really am using sed, grep and awk all in one line. Just wrong :) > > The "sum_by_date.py" program simply adds up all the values with the > same first column, and remembers the largest value seen. This is > giving me the cumulative GC time for extended pauses (over 0.5s), and > the maximum pause seen in a given time period (hourly), eg: > > 2015-11-13T11 119.124037 2.203569 > 2015-11-13T12 184.683309 3.156565 > 2015-11-13T13 65.934526 1.978202 > 2015-11-13T14 63.970378 1.411700 > > > This is fine for seeing that we have a problem. However, really I need > to get this in to our monitoring systems - we use munin. I'm > struggling to work out the best way to extract this information for > our monitoring systems, and I think this might be my naivety about > Java, and working out what should be logged. > > I've turned on JMX debugging, and looking at the different beans > available using jconsole, but I'm drowning in information. What would > be the best thing to monitor? > > Ideally, like the stats above, I'd like to know the cumulative time > spent paused in GC since the last poll, and the longest GC pause that > we see. munin polls every 5 minutes, are there suitable counters > exposed by JMX that it could extract? > > Thanks in advance > > Tom