You can also print GC details to log (the following example is verbose but you can tailor it to your needs):
-Xloggc:/opt/storm/logs/gc-worker-%ID%.log -verbose:gc -XX:GCLogFileSize=10m -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps On Tue, Sep 9, 2014 at 10:15 AM, Cyrille Karmann <[email protected]> wrote: > This line: > Shutting down and clearing state for id a8733f89-9f41-4624-9bce-d9f2d8c449ee. > Current supervisor time: 1410261997. State: :timed-out, > shows that it is a heartbeat time-out. > > In my experience, GC in the worker has been the main reason for time-outs > of the heartbeat. But to be sure, you should try to connect to your worker > with a tool like JVisualVM and observe what is going on. > > That being said, according to your logs, you are using the default GC > algorithm for your workers, which can make the JVM do significant pauses, > even if there is plenty of heap. You should use CMS instead, to limit > pauses. For this, add the following to childopts: > > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > > > > > 2014-09-09 7:39 GMT-04:00 Palak Shah <[email protected]>: > > I looked at the supervisor and nimbus logs and confirmed that the topology >> is being rebalanced. It could be because of timeout. Could you help me >> figure out exactly why this timeout occurs and how I can fix it? >> >> I have enabled GC for my workers and supervisor. Could it be the reason >> that worker is not able to send a heartbeat? I tried increasing the heap >> size allotted to each worker by tweaking the value of worker.childopts to >> have "-Xmx768m". But I did not see any difference in behaviour of my >> topology. How can I fix this issue? >> >> here are my supervisor logs - >> >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shutting down and clearing >> state for id a8733f89-9f41-4624-9bce-d9f2d8c449ee. Current supervisor time: >> 1410261997. State: :timed-out, Heartbeat: >> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1410261964, >> :storm-id "popeyeTopology-2-1410260830", :executors #{[4 4] [7 7] [-1 -1] >> [1 1]}, :port 6700} >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shutting down >> 9f26d478-1963-425e-a0c2-139712f32b9e:a8733f89-9f41-4624-9bce-d9f2d8c449ee >> 2014-09-09 16:56:37 b.s.util [INFO] Error when trying to kill 28950. >> Process is probably already dead. >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Shut down >> 9f26d478-1963-425e-a0c2-139712f32b9e:a8733f89-9f41-4624-9bce-d9f2d8c449ee >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Launching worker with >> assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id >> "popeyeTopology-2-1410260830", :executors ([7 7] [4 4] [1 1])} for this >> supervisor 9f26d478-1963-425e-a0c2-139712f32b9e on port 6700 with id >> 78233b86-b7f3-4590-9709-ba0d71130d7e >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] Launching worker with >> command: '/usr/lib/jvm/java-7-oracle/bin/java' '-server' '-Xmx768m' >> '-verbose:gc' '-XX:+PrintGCTimeStamps' '-XX:+PrintGCDetails' >> '-Dcom.sun.management.jmxremote' '-Dcom.sun.management.jmxremote.ssl=false' >> '-Dcom.sun.management.jmxremote.authenticate=false' >> '-Dcom.sun.management.jmxremote.port=16700' >> '-Djava.library.path=/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/resources/Linux-amd64:/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/resources:/usr/lib/jvm/java-7-oracle/lib' >> '-Dlogfile.name=worker-6700.log' >> '-Dstorm.home=/home/stormcluster/Storm/apache-storm-0.9.2-incubating' >> '-Dlogback.configurationFile=/home/stormcluster/Storm/apache-storm-0.9.2-incubating/logback/cluster.xml' >> '-Dstorm.id=popeyeTopology-2-1410260830' >> '-Dworker.id=78233b86-b7f3-4590-9709-ba0d71130d7e' '-Dworker.port=6700' >> '-cp' >> '/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/chill-java-0.3.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-exec-1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-io-2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/joda-time-2.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-servlet-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/scala-library-2.9.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clout-1.0.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/logback-core-1.0.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/json-simple-1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/curator-client-2.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/carbonite-1.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/asm-4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/httpcore-4.3.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jetty-util-6.1.26.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clj-stacktrace-0.2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/log4j-over-slf4j-1.6.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/netty-3.2.2.Final.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/compojure-1.1.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/zookeeper-3.4.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-devel-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jgrapht-core-0.9.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/guava-13.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/objenesis-1.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/minlog-1.2.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/core.incubator-0.1.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clj-time-0.4.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/servlet-api-2.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/storm-core-0.9.2-incubating.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-logging-1.1.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/logback-classic-1.0.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/clojure-1.5.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-jetty-adapter-0.3.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.macro-0.1.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/reflectasm-1.07-shaded.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/math.numeric-tower-0.0.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-fileupload-1.2.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/servlet-api-2.5-20081211.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jetty-6.1.26.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/jline-2.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.cli-0.2.4.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/slf4j-api-1.6.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/hiccup-0.3.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-codec-1.6.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/snakeyaml-1.11.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/zmq.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/netty-3.6.3.Final.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/storm-kafka-0.9.2-incubating.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/ring-core-1.1.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/kafka_2.9.2-0.8.1.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/tools.logging-0.2.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/kryo-2.21.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/disruptor-2.10.1.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/httpclient-4.3.3.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/commons-lang-2.5.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/lib/curator-framework-2.4.0.jar:/home/stormcluster/Storm/apache-storm-0.9.2-incubating/conf:/tmp/stormtmp/supervisor/stormdist/popeyeTopology-2-1410260830/stormjar.jar' >> 'backtype.storm.daemon.worker' 'popeyeTopology-2-1410260830' >> '9f26d478-1963-425e-a0c2-139712f32b9e' '6700' >> '78233b86-b7f3-4590-9709-ba0d71130d7e' >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:37 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:38 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:38 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:39 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:39 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:40 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> 2014-09-09 16:56:40 b.s.d.supervisor [INFO] >> 78233b86-b7f3-4590-9709-ba0d71130d7e still hasn't started >> >> >> Thanks, >> Palak >> >> On Mon, Sep 8, 2014 at 6:46 PM, Cyrille Karmann <[email protected] >> > wrote: >> >>> Look at the supervisors logs. There could be a timeout that make a >>> supervisor to force a worker to shut down, triggering a rebalancing. >>> >>> >>> >>> 2014-09-08 7:17 GMT-04:00 Palak Shah <[email protected]>: >>> >>> I have a topology that uses a Kafka spout to read values from a Kafka >>>> queue. I used the kafkaSpout that came with storm-0.9.2-incubating. >>>> >>>> I observed that the nimbus rebalances the topology very often. The >>>> topology suddenly shuts down and starts again with the tasks running on >>>> different machines. I wanted to know why storm nimbus is rebalancing my >>>> topology, so I observed the storm throughput, latency, load on bolts and >>>> even system metrics like cpu and memory utilization, but I could not see a >>>> pattern. >>>> >>>> Can someone explain what are the factors that lead to rebalancing of >>>> topology in storm? >>>> >>>> Thanks, >>>> Palak Shah >>>> >>> >>> >>> >>> -- >>> Cyrille Karmann >>> +1-514-659-1209 >>> [email protected] >>> >> >> > > > -- > Cyrille Karmann > +1-514-659-1209 > [email protected] >
