Hi guys.
I have a solr cloud, consisting of 3 zookeper VMs running 3.4.5
backported from Ubuntu 14.04 LTS to 12.04 LTS.
They are orchestrating 4 solr nodes, which have 2 cores. Each core is
sharded, so 1 shard is on each of the solr nodes.
Solr runs under tomcat7 and ubuntus latest openjdk 7.
Version of solr is 4.2.1.
Each of the nodes have around 7GB of data, and JVM is set to run 8GB
heap. All solr nodes have 16GB RAM.
Few weeks back we started having issues with this installation. Tomcat
was filling up catalina.out with following messages:
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
Only solution was to restart all 4 tomcats on 4 solr nodes. After that,
issue would rectify itself, but would occur again, approximately a week
after a restart.
This happened last time yesterday, and I succeded in recording some of
the stuff happening on boxes via Zabbix and atop.
Basically at 15:35 load on machine went berzerk, jumping from around 0.5
to around 30+
Zabbix and atop didn't notice any heavy IO, all the other processes were
practicaly idle, only JVM (tomcat) exploded with cpu usage increasing
from standard ~80% to around ~750%
These are the parts of Atop recordings on one of the node. Note that
they are 10 mins appart:
(15:28:42)
CPL | avg1 0.12 | | avg5 0.36 | avg15 0.38 |
(15:38:42)
CPL | avg1 8.54 | | avg5 3.62 | avg15 1.61 |
(15:48:42)
CPL | avg1 30.14 | | avg5 27.09 | avg15 14.73 |
This is the status of tomcat at last point (15:48:42):
28891 tomcat7 tomcat7 411 8.68s 70m14s
209.9M 204K 0K 5804K -- -
S 5 704% java
I have noticed similar stuff happening around the solr nodes. At 17:41
on call person decided to hard reset all the solr nodes, and cloud came
back up running normally after that.
These are the logs that I found on first node:
Aug 17, 2014 3:44:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
Aug 17, 2014 3:46:12 PM
org.apache.solr.cloud.OverseerCollectionProcessor run
WARNING: Overseer cannot talk to ZK
Aug 17, 2014 3:46:12 PM
org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader
Then a bunch of :
Aug 17, 2014 3:46:42 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
until the server was rebooted.
On other nodes I can see:
node2:
Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.103:8080_solr_myappcore=myapp
Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.103:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:46:24 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: IOException occured
when talking to server at: http://node1:8080/solr/myapp
node4:
Aug 17, 2014 3:44:06 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.105:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:44:09 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.105:8080_solr_myappcore=myapp
Aug 17, 2014 3:45:37 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
My impression is that garbage collector is at fault here.
This is the cmdline of tomcat:
/usr/lib/jvm/java-7-openjdk-amd64/bin/java
-Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties
-Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC -DnumShards=2
-Djetty.port=8080
-DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181
-javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djav
.endorsed.dirs=/usr/share/tomcat7/endorsed -classpath
/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar
-Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7
-Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmp
org.apache.catalina.startup.Bootstrap start
So, I am using MarkSweepGC.
Do you have any suggestion how can I debug this further and potentially
eliminate the issue causing downtimes?