GC stalls cause Zookeeper timeout during uninvert for facet field

Arend-Jan Wijtzes Tue, 06 Nov 2012 03:07:33 -0800

Hi,

We are running a small solr cluster with 8 cores on 4 machines. This
database has about 1E9 very small documents. One of the statistics we
need requires a facet on a text field with high cardinality.


During the uninvert phase of this text field the searchers experience
long stalls because of the garbage collecting (20+ seconds pauses) which
causes Solr to lose the Zookeeper lease. Often they do not recover 
gracefully and as a result the cluster becomes degraded:

"SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props"

This is an known open issue.

I explored several options to try and work around this. However I'm new
to Solr and need some help.

We tried running more cores:
We went from 4 to 8 cores. Does it make sense to go to 16 cores on 4
machines?


GC tuning:
This helped a lot but not enough to prevent the lease expirations. I'm
by no means a Java GC expert and would appreciate any tips to improve
this further. Current settings are:

Java HotSpot(TM) 64-Bit Server VM (20.0-b11)
-Xloggc:/home/solr/solr/log/gc.log
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintTenuringDistribution
-XX:+PrintClassHistogram
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=75
-XX:MaxGCPauseMillis=10000
-XX:+CMSIncrementalMode
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-Djava.awt.headless=true
-Xss256k
-Xmx18g
-Xms1g
-DzkHost=ds30:2181,ds31:2181,ds32:2181

Actual memory stats accoring to top are: 74GB virtual, 11GB resident.
The GC log shows:
- age   1:   39078968 bytes,   39078968 total
: 342633K->38290K(345024K), 24.7992520 secs]
9277535K->9058682K(11687832K) icms_dc=73 , 24.7993810 secs] [Times:
user=366.87 sys=26.31, real=24.79 secs]
Total time for which application threads were stopped: 24.8005790
seconds
975.478: [GC 975.478: [ParNew
Desired survivor size 19628032 bytes, new threshold 1 (max 4)
- age   1:   38277672 bytes,   38277672 total
: 343750K->37537K(345024K), 22.4217640 secs]
9364142K->9131962K(11687832K) icms_dc=73 , 22.4218650 secs] [Times:
user=331.25 sys=23.85, real=22.42 secs]
Total time for which application threads were stopped: 22.4231750
seconds

etc.


Solr version:
4.0.0.2012.10.06.03.04.33

Current hardware consists of 4 machines, of which each has:
2x E5645 CPU, total of 24 cores
48GB mem
8 x SATA 7200RPM in raid 10


What would be a good strategy to try and get this database to perform
the way we need it? Would it make sense to split it up into 16 shards?
Ways to improve the GC behavior?

Any help would be grately appreciated.

AJ

-- 
Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl

GC stalls cause Zookeeper timeout during uninvert for facet field

Reply via email to