solr cloud going down repeatedly

Jakov Sosic Mon, 18 Aug 2014 10:32:35 -0700

Hi guys.

I have a solr cloud, consisting of 3 zookeper VMs running 3.4.5backported from Ubuntu 14.04 LTS to 12.04 LTS.

They are orchestrating 4 solr nodes, which have 2 cores. Each core issharded, so 1 shard is on each of the solr nodes.


Solr runs under tomcat7 and ubuntus latest openjdk 7.

Version of solr is 4.2.1.

Each of the nodes have around 7GB of data, and JVM is set to run 8GBheap. All solr nodes have 16GB RAM.

Few weeks back we started having issues with this installation. Tomcatwas filling up catalina.out with following messages:


SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

Only solution was to restart all 4 tomcats on 4 solr nodes. After that,issue would rectify itself, but would occur again, approximately a weekafter a restart.

This happened last time yesterday, and I succeded in recording some ofthe stuff happening on boxes via Zabbix and atop.

Basically at 15:35 load on machine went berzerk, jumping from around 0.5to around 30+

Zabbix and atop didn't notice any heavy IO, all the other processes werepracticaly idle, only JVM (tomcat) exploded with cpu usage increasingfrom standard ~80% to around ~750%

These are the parts of Atop recordings on one of the node. Note thatthey are 10 mins appart:


(15:28:42)
CPL | avg1    0.12  |               | avg5    0.36  | avg15   0.38  |

(15:38:42)
CPL | avg1    8.54  |               | avg5    3.62  | avg15   1.61  |

(15:48:42)
CPL | avg1   30.14  |               | avg5   27.09  | avg15  14.73  |



This is the status of tomcat at last point (15:48:42):

28891 tomcat7 tomcat7 411 8.68s 70m14s209.9M 204K 0K 5804K -- -S 5 704% java

I have noticed similar stuff happening around the solr nodes. At 17:41on call person decided to hard reset all the solr nodes, and cloud cameback up running normally after that.


These are the logs that I found on first node:

Aug 17, 2014 3:44:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

Aug 17, 2014 3:46:12 PMorg.apache.solr.cloud.OverseerCollectionProcessor run

WARNING: Overseer cannot talk to ZK

Aug 17, 2014 3:46:12 PMorg.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader

WARNING:

org.apache.zookeeper.KeeperException$SessionExpiredException:KeeperErrorCode = Session expired for /overseer_elect/leader


Then a bunch of :

Aug 17, 2014 3:46:42 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

until the server was rebooted.


On other nodes I can see:
node2:

Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close

WARNING: Stopping recovery forzkNodeName=10.100.254.103:8080_solr_myappcore=myapp

Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close

WARNING: Stopping recovery forzkNodeName=10.100.254.103:8080_solr_myapp2core=myapp2

Aug 17, 2014 3:46:24 PM org.apache.solr.common.SolrException log

SEVERE: org.apache.solr.common.SolrException:org.apache.solr.client.solrj.SolrServerException: IOException occuredwhen talking to server at: http://node1:8080/solr/myapp


node4:

Aug 17, 2014 3:44:06 PM org.apache.solr.cloud.RecoveryStrategy close

WARNING: Stopping recovery forzkNodeName=10.100.254.105:8080_solr_myapp2core=myapp2

Aug 17, 2014 3:44:09 PM org.apache.solr.cloud.RecoveryStrategy close

WARNING: Stopping recovery forzkNodeName=10.100.254.105:8080_solr_myappcore=myapp

Aug 17, 2014 3:45:37 PM org.apache.solr.common.SolrException log

SEVERE: There was a problem finding the leader inzk:org.apache.solr.common.SolrException: Could not get leader props





My impression is that garbage collector is at fault here.

This is the cmdline of tomcat:

/usr/lib/jvm/java-7-openjdk-amd64/bin/java-Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties-Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC -DnumShards=2-Djetty.port=8080-DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181-javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote-Dcom.sun.management.jmxremote.port=9010-Dcom.sun.management.jmxremote.local.only=false-Dcom.sun.management.jmxremote.authenticate=false-Dcom.sun.management.jmxremote.ssl=false-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djav.endorsed.dirs=/usr/share/tomcat7/endorsed -classpath/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar-Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7-Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmporg.apache.catalina.startup.Bootstrap start



So, I am using MarkSweepGC.

Do you have any suggestion how can I debug this further and potentiallyeliminate the issue causing downtimes?

solr cloud going down repeatedly

Reply via email to