I don't know about what caused your initial failure (for me it's an bolt
that wraps PHP, and the PHP OOMs periodically), but I also had Netty get
jammed up trying to recover from the failure in 0.9.3-rc1 with similar
logging.  For me, patching with
https://github.com/apache/storm/pull/268 (STORM-329)
fixed it.

will

On Wed, Nov 19, 2014 at 11:49 AM, Sheldon White <[email protected]>
wrote:

> We've recently implemented a Storm topology that pulls data from one DB,
> performs some simple transformations and populates another DB. The spout
> and bolt code is all pretty simple and doesn't retain any data (other than
> a DB connection pool). This topology is running through about 400 million
> DB records.
> We have a cluster of 8 VMs running storm (0.9.3) and a single VM running
> zookeeper (3.4.5). The zookeeper node never shows much load.
> We are periodically seeing behavior where the zookeeper and the storm
> topology seem to "lose track" of each other. The supervisor.log files shows
> a loop of process restarts:
>
> 2014-11-17 00:54:58 b.s.util [INFO] Error when trying to kill 5109.
> Process is probably already dead.
> 2014-11-17 00:54:59 b.s.util [INFO] Error when trying to kill 5109.
> Process is probably already dead.
> 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Shut down
> de55f9bf-d630-4e11-bc8c-9bbaa9899212:d6d231cc-2f2c-4bad-924e-570c6357d97d
> 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with
> assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "swhite-test-2014-11-14-13-58-2
> 1-3-1416002313", :executors ([5 5] [37 37] [13 13] [21 21] [29 29])} for
> this supervisor de55f9bf-d630-4e11-bc8c-9bbaa9899212 on port 6702 with id
> 59b3b5d9-bd9c-4da7-82
> e4-38172f7dc232
> 2014-11-17 00:54:59 b.s.d.supervisor [INFO] Launching worker with command:
> 'java' '-server' '-Xmx2048m' '-XX:+UseConcMarkSweepGC' '-XX:+UseParNewGC'
> '-XX:+UseConcMarkSw
> eepGC' '-XX:NewSize=128m' '-XX:CMSInitiatingOccupancyFraction=70'
> '-XX:-CMSConcurrentMTEnabled' '-Djava.net.preferIPv4Stack=true'
> '-Djava.library.path=/opt/storm-local/
>
> supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/resources/Linux-amd64:/opt/storm-local/supervisor/stormdist/swhite-test-2014-11-14-13-58-21-3-14160023
> 13/resources:/usr/local/lib:/opt/local/lib:/usr/lib'
> '-Dlogfile.name=worker-6702.log' '-Dstorm.home=/opt/apache-storm-0.9.3-rc1'
> '-Dstorm.log.dir=/opt/apache-storm-0.9.
> 3-rc1/logs'
> '-Dlogback.configurationFile=/opt/apache-storm-0.9.3-rc1/logback/cluster.xml'
> '-Dstorm.id=swhite-test-2014-11-14-13-58-21-3-1416002313'
> '-Dworker.id=59b3b5d
> 9-bd9c-4da7-82e4-38172f7dc232' '-Dworker.port=6702' '-cp'
> '/opt/apache-storm-0.9.3-rc1/lib/math.numeric-tower-0.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/slf4j-api-1.7.5.
>
> jar:/opt/apache-storm-0.9.3-rc1/lib/ring-devel-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/disruptor-3.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-6.1.26.jar:/opt/apac
>
> he-storm-0.9.3-rc1/lib/logback-core-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-jetty-adapter-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/carbonite-1.4.0.jar:/opt/ap
>
> ache-storm-0.9.3-rc1/lib/compojure-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-0.9.3-rc1/lib/chill-java-0.3.5.jar:/opt/apache
>
> -storm-0.9.3-rc1/lib/logback-classic-1.0.13.jar:/opt/apache-storm-0.9.3-rc1/lib/minlog-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/kryo-2.21.jar:/opt/apache-storm-0.9.3-rc1
>
> /lib/tools.macro-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-exec-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-io-2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/cloju
>
> re-1.5.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-servlet-0.3.11.jar:/opt/apache-storm-0.9.3-rc1/lib/storm-core-0.9.3-rc1.jar:/opt/apache-storm-0.9.3-rc1/lib/core.incub
>
> ator-0.1.0.jar:/opt/apache-storm-0.9.3-rc1/lib/asm-4.0.jar:/opt/apache-storm-0.9.3-rc1/lib/jetty-util-6.1.26.jar:/opt/apache-storm-0.9.3-rc1/lib/commons-lang-2.5.jar:/o
>
> pt/apache-storm-0.9.3-rc1/lib/commons-logging-1.1.3.jar:/opt/apache-storm-0.9.3-rc1/lib/servlet-api-2.5.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-time-0.4.1.jar:/opt/apac
>
> he-storm-0.9.3-rc1/lib/objenesis-1.2.jar:/opt/apache-storm-0.9.3-rc1/lib/json-simple-1.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jgrapht-core-0.9.0.jar:/opt/apache-storm-0.
>
> 9.3-rc1/lib/commons-fileupload-1.2.1.jar:/opt/apache-storm-0.9.3-rc1/lib/ring-core-1.1.5.jar:/opt/apache-storm-0.9.3-rc1/lib/snakeyaml-1.11.jar:/opt/apache-storm-0.9.3-
>
> rc1/lib/commons-codec-1.6.jar:/opt/apache-storm-0.9.3-rc1/lib/joda-time-2.0.jar:/opt/apache-storm-0.9.3-rc1/lib/clout-1.0.1.jar:/opt/apache-storm-0.9.3-rc1/lib/jline-2.
>
> 11.jar:/opt/apache-storm-0.9.3-rc1/lib/hiccup-0.3.6.jar:/opt/apache-storm-0.9.3-rc1/lib/tools.cli-0.2.4.jar:/opt/apache-storm-0.9.3-rc1/lib/reflectasm-1.07-shaded.jar:/
>
> opt/apache-storm-0.9.3-rc1/lib/tools.logging-0.2.3.jar:/opt/apache-storm-0.9.3-rc1/lib/clj-stacktrace-0.2.2.jar:/opt/apache-storm-0.9.3-rc1/conf:/opt/storm-local/superv
> isor/stormdist/swhite-test-2014-11-14-13-58-21-3-1416002313/stormjar.jar'
> 'backtype.storm.daemon.worker'
> 'swhite-test-2014-11-14-13-58-21-3-1416002313' 'de55f9bf-d630-4
> e11-bc8c-9bbaa9899212' '6702' '59b3b5d9-bd9c-4da7-82e4-38172f7dc232'
> 2014-11-17 00:54:59 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:54:59 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:55:00 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:55:00 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:55:01 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:55:01 b.s.d.supervisor [INFO]
> 59b3b5d9-bd9c-4da7-82e4-38172f7dc232 still hasn't started
> 2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 59b3b5d9-bd9c-4da7-82e4-38172f7dc232. Current supervisor time:
> 1416185702. State: :d
> isallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1416185701,
> :storm-id "swhite-test-2014-11-14-13-58-21-3-1416002313", :executors #{[-1
> -1
> ]}, :port 6702}
> 2014-11-17 00:55:02 b.s.d.supervisor [INFO] Shutting down
> de55f9bf-d630-4e11-bc8c-9bbaa9899212:59b3b5d9-bd9c-4da7-82e4-38172f7dc232
> 2014-11-17 00:55:03 b.s.util [INFO] Error when trying to kill 5174.
> Process is probably already dead.
> ...
>
> Also seeing this in the worker logs when things are stalled out:
> ...
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:20
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [220]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:21
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [221]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:22
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [222]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:23
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [223]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [224]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:24
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [225]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:25
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [226]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:26
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [227]
> /opt/apache-storm-0.9.3-rc1/logs/worker-6703.log.9:2014-11-17 00:49:27
> b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-storm-ap08.dev.g2/10.10.122.17:6703... [228]
> ...
>
> We've tried increasing the storm.zookeeper timeouts to 5 minutes:
>
> storm.zookeeper.session.timeout = 300000
> storm.zookeeper.connection.timeout = 300000
>
> and this allows the cluster to run much more stably, with good topology
> performance. But it still occasionally seems to stall out, and a manual
> rebalancing request is required to get it rolling. All VMs have fixed DNS
> entries and there doesn't seem to be any network issues at all.
>
> This behavior is pretty much identical to something reported in August:
> http://mail-archives.apache.org/mod_mbox/storm-user/201408.mbox/%3CFF975F8738DEE84AABF1BF95A4248E6B6A8F5393%40mvexe1.integral.com%3E
> but this thread didn't really get resolved.
>
> Question 1: Is there any reason to think we'd need additional zookeeper
> servers? The one server we're using doen't show any significant load.
> Question 2: Is it normal to manually rebalance a running topology? It
> would probably work but seems like something's not right.
> Question 3: What other clues should I be looking for?
>
> Thanks in advance
>
> -sheldon white-
> [email protected]
>



--

Reply via email to