I don't know about what caused your initial failure (for me it's an bolt that wraps PHP, and the PHP OOMs periodically), but I also had Netty get jammed up trying to recover from the failure in 0.9.3-rc1 with similar logging. For me, patching with https://github.com/apache/storm/pull/268 (STORM-329) fixed it.
will On Wed, Nov 19, 2014 at 11:49 AM, Sheldon White <[email protected]> wrote: > We've recently implemented a Storm topology that pulls data from one DB, > performs some simple transformations and populates another DB. The spout > and bolt code is all pretty simple and doesn't retain any data (other than > a DB connection pool). This topology is running through about 400 million > DB records. > We have a cluster of 8 VMs running storm (0.9.3) and a single VM running > zookeeper (3.4.5). The zookeeper node never shows much load. > We are periodically seeing behavior where the zookeeper and the storm > topology seem to "lose track" of each other. The supervisor.log files shows > a loop of process restarts: > > 2014-11-17 00:54:58 b.s.util [INFO] Error when trying to kill 5109. > Process is probably already dead. > 2014-11-17 00:54:59 b.s.util [INFO] Error when trying to kill 5109. > Process is probably already dead. > 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Shut down > de55f9bf-d630-4e11-bc8c-9bbaa9899212:d6d231cc-2f2c-4bad-924e-570c6357d97d > 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with > assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id > "swhite-test-2014-11-14-13-58-2 > 1-3-1416002313", :executors ([5 5] [37 37] [13 13] [21 21] [29 29])} for > this supervisor de55f9bf-d630-4e11-bc8c-9bbaa9899212 on port 6702 with id > 59b3b5d9-bd9c-4da7-82 > e4-38172f7dc232 > 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with command: > 'java' '-server' '-Xmx2048m' '-XX:+UseConcMarkSweepGC' '-XX:+UseParNewGC' > '-XX:+UseConcMarkSw > eepGC' '-XX:NewSize=128m' '-XX:CMSInitiatingOccupancyFraction=70' > '-XX:-CMSConcurrentMTEnabled' '-Djava.net.preferIPv4Stack=true' > '-Djava.library.path=/opt/storm-local/ > > supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/resources/Linux-amd64:/opt/storm-local/supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-14160023 > 13/resources:/usr/local/lib:/opt/local/lib:/usr/lib' > '-Dlogfile.name=worker-6702.log' '-Dstorm.home=/opt/apache-storm-0.9.3-rc1' > '-Dstorm.log.dir=/opt/apache-storm-0.9. > 3-rc1/logs' > '-Dlogback.configurationFile=/opt/apache-storm-0.9.3-rc1/logback/cluster.xml' > '-Dstorm.id=swhite-test-2014-11-14-13-58-21-3-1416002313' > '-Dworker.id=59b3b5d > 9-bd9c-4da7-82e4-38172f7dc232' '-Dworker.port=6702' '-cp' > '/opt/apache-storm-0.9.3-rc1/lib/math.numeric-tower-0.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/slf4j-api-1.7.5. > > jar:/opt/apache-storm-0.9.3-rc1/lib/ring-devel-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/disruptor-3.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-6.1.26.jar:/opt/apac > > he-storm-0.9.3-rc1/lib/logback-core-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-jetty-adapter-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/carbonite-1.4.0.jar:/opt/ap > > ache-storm-0.9.3-rc1/lib/compojure-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-0.9.3-rc1/lib/chill-java-0.3.5.jar:/opt/apache > > -storm-0.9.3-rc1/lib/logback-classic-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/minlog-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/kryo-2.21.jar:/opt/apache-storm-0.9.3-rc1 > > /lib/tools.macro-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-exec-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-io-2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/cloju > > re-1.5.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-servlet-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/storm-core-0.9.3-rc1.jar:/opt/apache-storm-0.9.3-rc1/lib/core.incub > > ator-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/asm-4.0.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-util-6.1.26.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-lang-2.5.jar:/o > > pt/apache-storm-0.9.3-rc1/lib/commons-logging-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/servlet-api-2.5.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-time-0.4.1.jar:/opt/apac > > he-storm-0.9.3-rc1/lib/objenesis-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/json-simple-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jgrapht-core-0.9.0.jar:/opt/apache-storm-0. > > 9.3-rc1/lib/commons-fileupload-1.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-core-1.1.5.jar:/opt/apache-storm-0.9.3-rc1/lib/snakeyaml-1.11.jar:/opt/apache-storm-0.9.3- > > rc1/lib/commons-codec-1.6.jar:/opt/apache-storm-0.9.3-rc1/lib/joda-time-2.0.jar:/opt/apache-storm-0.9.3-rc1/lib/clout-1.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jline-2. > > 11.jar:/opt/apache-storm-0.9.3-rc1/lib/hiccup-0.3.6.jar:/opt/apache-storm-0.9.3-rc1/lib/tools.cli-0.2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/reflectasm-1.07-shaded.jar:/ > > opt/apache-storm-0.9.3-rc1/lib/tools.logging-0.2.3.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-stacktrace-0.2.2.jar:/opt/apache-storm-0.9.3-rc1/conf:/opt/storm-local/superv > isor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/stormjar.jar' > 'backtype.storm.daemon.worker' > 'swhite-test-2014-11-14-13-58-21-3-1416002313' 'de55f9bf-d630-4 > e11-bc8c-9bbaa9899212' '6702' '59b3b5d9-bd9c-4da7-82e4-38172f7dc232' > 2014-11-17 00:54:59 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:54:59 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:55:00 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:55:00 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:55:01 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:55:01 b.s.d.supervisor [INFO] > 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started > 2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 59b3b5d9-bd9c-4da7-82e4-38172f7dc232. Current supervisor time: > 1416185702. State: :d > isallowed, Heartbeat: > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1416185701, > :storm-id "swhite-test-2014-11-14-13-58-21-3-1416002313", :executors #{[-1 > -1 > ]}, :port 6702} > 2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down > de55f9bf-d630-4e11-bc8c-9bbaa9899212:59b3b5d9-bd9c-4da7-82e4-38172f7dc232 > 2014-11-17 00:55:03 b.s.util [INFO] Error when trying to kill 5174. > Process is probably already dead. > ... > > Also seeing this in the worker logs when things are stalled out: > ... > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:20 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [220] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:21 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [221] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:22 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [222] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:23 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [223] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [224] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [225] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:25 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [226] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:26 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [227] > /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:27 > b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [228] > ... > > We've tried increasing the storm.zookeeper timeouts to 5 minutes: > > storm.zookeeper.session.timeout = 300000 > storm.zookeeper.connection.timeout = 300000 > > and this allows the cluster to run much more stably, with good topology > performance. But it still occasionally seems to stall out, and a manual > rebalancing request is required to get it rolling. All VMs have fixed DNS > entries and there doesn't seem to be any network issues at all. > > This behavior is pretty much identical to something reported in August: > http://mail-archives.apache.org/mod_mbox/storm-user/201408.mbox/%3CFF975F8738DEE84AABF1BF95A4248E6B6A8F5393%40mvexe1.integral.com%3E > but this thread didn't really get resolved. > > Question 1: Is there any reason to think we'd need additional zookeeper > servers? The one server we're using doen't show any significant load. > Question 2: Is it normal to manually rebalance a running topology? It > would probably work but seems like something's not right. > Question 3: What other clues should I be looking for? > > Thanks in advance > > -sheldon white- > [email protected] > --
