This line: Shutting down and clearing state for id a8733f89-9f41-4624-9bce-d9f2d8c449ee. Current supervisor time: 1410261997. State: :timed-out, shows that it is a heartbeat time-out.
In my experience, GC in the worker has been the main reason for time-outs of the heartbeat. But to be sure, you should try to connect to your worker with a tool like JVisualVM and observe what is going on. That being said, according to your logs, you are using the default GC algorithm for your workers, which can make the JVM do significant pauses, even if there is plenty of heap. You should use CMS instead, to limit pauses. For this, add the following to childopts: -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 2014-09-09 7:39 GMT-04:00 Palak Shah <[email protected]>: > I looked at the supervisor and nimbus logs and confirmed that the topology > is being rebalanced. It could be because of timeout. Could you help me > figure out exactly why this timeout occurs and how I can fix it? > > I have enabled GC for my workers and supervisor. Could it be the reason > that worker is not able to send a heartbeat? I tried increasing the heap > size allotted to each worker by tweaking the value of worker.childopts to > have "-Xmx768m". But I did not see any difference in behaviour of my > topology. How can I fix this issue? > > here are my supervisor logs - > > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shutting down and clearing > state for id a8733f89-9f41-4624-9bce-d9f2d8c449ee. Current supervisor time: > 1410261997. State: :timed-out, Heartbeat: > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1410261964, > :storm-id "popeyeTopology-2-1410260830", :executors #{[4 4] [7 7] [-1 -1] > [1 1]}, :port 6700} > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shutting down > 9f26d478-1963-425e-a0c2-139712f32b9e:a8733f89-9f41-4624-9bce-d9f2d8c449ee > 2014-09-09 16:56:37 b.s.util [INFO] Error when trying to kill 28950. > Process is probably already dead. > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shut down > 9f26d478-1963-425e-a0c2-139712f32b9e:a8733f89-9f41-4624-9bce-d9f2d8c449ee > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Launching worker with > assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id > "popeyeTopology-2-1410260830", :executors ([7 7] [4 4] [1 1])} for this > supervisor 9f26d478-1963-425e-a0c2-139712f32b9e on port 6700 with id > 78233b86-b7f3-4590-9709-ba0d71130d7e > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Launching worker with command: > '/usr/lib/jvm/java-7-oracle/bin/java' '-server' '-Xmx768m' '-verbose:gc' > '-XX:+PrintGCTimeStamps' '-XX:+PrintGCDetails' > '-Dcom.sun.management.jmxremote' '-Dcom.sun.management.jmxremote.ssl=false' > '-Dcom.sun.management.jmxremote.authenticate=false' > '-Dcom.sun.management.jmxremote.port=16700' > '-Djava.library.path=/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/resources/Linux-amd64:/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/resources:/usr/lib/jvm/java-7-oracle/lib' > '-Dlogfile.name=worker-6700.log' > '-Dstorm.home=/home/stormcluster/Storm/apache-storm-0.9.2-incubating' > '-Dlogback.configurationFile=/home/stormcluster/Storm/apache-storm-0.9.2-incubating/logback/cluster.xml' > '-Dstorm.id=popeyeTopology-2-1410260830' > '-Dworker.id=78233b86-b7f3-4590-9709-ba0d71130d7e' '-Dworker.port=6700' > '-cp' > '/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/chill-java-0.3.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-exec-1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-io-2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/joda-time-2.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-servlet-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/scala-library-2.9.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clout-1.0.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/logback-core-1.0.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/json-simple-1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/curator-client-2.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/carbonite-1.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/asm-4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/httpcore-4.3.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jetty-util-6.1.26.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clj-stacktrace-0.2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/log4j-over-slf4j-1.6.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/netty-3.2.2.Final.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/compojure-1.1.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/zookeeper-3.4.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-devel-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jgrapht-core-0.9.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/guava-13.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/objenesis-1.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/minlog-1.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/core.incubator-0.1.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clj-time-0.4.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/servlet-api-2.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/storm-core-0.9.2-incubating.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-logging-1.1.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/logback-classic-1.0.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clojure-1.5.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-jetty-adapter-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.macro-0.1.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/reflectasm-1.07-shaded.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/math.numeric-tower-0.0.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-fileupload-1.2.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/servlet-api-2.5-20081211.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jetty-6.1.26.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jline-2.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.cli-0.2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/slf4j-api-1.6.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/hiccup-0.3.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-codec-1.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/snakeyaml-1.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/zmq.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/netty-3.6.3.Final.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/storm-kafka-0.9.2-incubating.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-core-1.1.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/kafka_2.9.2-0.8.1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.logging-0.2.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/kryo-2.21.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/disruptor-2.10.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/httpclient-4.3.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-lang-2.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/curator-framework-2.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/conf:/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/stormjar.jar' > 'backtype.storm.daemon.worker' 'popeyeTopology-2-1410260830' > '9f26d478-1963-425e-a0c2-139712f32b9e' '6700' > '78233b86-b7f3-4590-9709-ba0d71130d7e' > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:37 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:38 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:38 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:39 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:39 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:40 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > 2014-09-09 16:56:40 b.s.d.supervisor [INFO] > 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started > > > Thanks, > Palak > > On Mon, Sep 8, 2014 at 6:46 PM, Cyrille Karmann <[email protected]> > wrote: > >> Look at the supervisors logs. There could be a timeout that make a >> supervisor to force a worker to shut down, triggering a rebalancing. >> >> >> >> 2014-09-08 7:17 GMT-04:00 Palak Shah <[email protected]>: >> >> I have a topology that uses a Kafka spout to read values from a Kafka >>> queue. I used the kafkaSpout that came with storm-0.9.2-incubating. >>> >>> I observed that the nimbus rebalances the topology very often. The >>> topology suddenly shuts down and starts again with the tasks running on >>> different machines. I wanted to know why storm nimbus is rebalancing my >>> topology, so I observed the storm throughput, latency, load on bolts and >>> even system metrics like cpu and memory utilization, but I could not see a >>> pattern. >>> >>> Can someone explain what are the factors that lead to rebalancing of >>> topology in storm? >>> >>> Thanks, >>> Palak Shah >>> >> >> >> >> -- >> Cyrille Karmann >> +1-514-659-1209 >> [email protected] >> > > -- Cyrille Karmann +1-514-659-1209 [email protected]
