Hi guys.

I have a solr cloud, consisting of 3 zookeper VMs running 3.4.5 backported from Ubuntu 14.04 LTS to 12.04 LTS.

They are orchestrating 4 solr nodes, which have 2 cores. Each core is sharded, so 1 shard is on each of the solr nodes.

Solr runs under tomcat7 and ubuntus latest openjdk 7.

Version of solr is 4.2.1.

Each of the nodes have around 7GB of data, and JVM is set to run 8GB heap. All solr nodes have 16GB RAM.


Few weeks back we started having issues with this installation. Tomcat was filling up catalina.out with following messages:

SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:


Only solution was to restart all 4 tomcats on 4 solr nodes. After that, issue would rectify itself, but would occur again, approximately a week after a restart.

This happened last time yesterday, and I succeded in recording some of the stuff happening on boxes via Zabbix and atop.


Basically at 15:35 load on machine went berzerk, jumping from around 0.5 to around 30+

Zabbix and atop didn't notice any heavy IO, all the other processes were practicaly idle, only JVM (tomcat) exploded with cpu usage increasing from standard ~80% to around ~750%

These are the parts of Atop recordings on one of the node. Note that they are 10 mins appart:

(15:28:42)
CPL | avg1    0.12  |               | avg5    0.36  | avg15   0.38  |

(15:38:42)
CPL | avg1    8.54  |               | avg5    3.62  | avg15   1.61  |

(15:48:42)
CPL | avg1   30.14  |               | avg5   27.09  | avg15  14.73  |



This is the status of tomcat at last point (15:48:42):
28891 tomcat7 tomcat7 411 8.68s 70m14s 209.9M 204K 0K 5804K -- - S 5 704% java


I have noticed similar stuff happening around the solr nodes. At 17:41 on call person decided to hard reset all the solr nodes, and cloud came back up running normally after that.

These are the logs that I found on first node:

Aug 17, 2014 3:44:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

Aug 17, 2014 3:46:12 PM org.apache.solr.cloud.OverseerCollectionProcessor run
WARNING: Overseer cannot talk to ZK
Aug 17, 2014 3:46:12 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader

Then a bunch of :

Aug 17, 2014 3:46:42 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

until the server was rebooted.


On other nodes I can see:
node2:

Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for zkNodeName=10.100.254.103:8080_solr_myappcore=myapp
Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for zkNodeName=10.100.254.103:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:46:24 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://node1:8080/solr/myapp

node4:

Aug 17, 2014 3:44:06 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for zkNodeName=10.100.254.105:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:44:09 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for zkNodeName=10.100.254.105:8080_solr_myappcore=myapp
Aug 17, 2014 3:45:37 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props




My impression is that garbage collector is at fault here.

This is the cmdline of tomcat:

/usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties -Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC -DnumShards=2 -Djetty.port=8080 -DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181 -javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djav .endorsed.dirs=/usr/share/tomcat7/endorsed -classpath /usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7 -Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmp org.apache.catalina.startup.Bootstrap start


So, I am using MarkSweepGC.

Do you have any suggestion how can I debug this further and potentially eliminate the issue causing downtimes?

Reply via email to