I had a similar problem, with Netty my Trident transactional topology was getting stuck after several occurences of Kafka spout restarting due to Kafka SocketTimeouts (I can reproduce this bug by blocking access to Kafka from the Supervisor machines with iptables, only a few tries are needed to reproduce it). Reverted to ZMQ and now it works flawlessly.
I'll prepare a reproducible test case and fill a JIRA bug report ASAP. On Jun 17, 2014 3:51 PM, "Romain Leroux" <[email protected]> wrote: > As I read in different topics, here also simply switching back to ZeroMQ > solved the issue ... > > > 2014-06-13 21:58 GMT+09:00 Romain Leroux <[email protected]>: > >> After tuning a trident topology (kafka->storm->cassandra) to run on 1 >> worker (so on 1 server), it works really well. >> >> I tried to deploy it using 2 workers on 1 server or 2 workers on 2 >> servers. >> The result is the same, nothing happens, no tuples are emitted and no >> messages in the logs. >> >> A quick profiling showed me that : >> >> 77% of CPU time is main-SendThread(a.zookeeper.hostname:2181) >> org.apache.zookeeper.ClientCnx$sendThreadrun() >> sun.nio.ch.SelectorImpl.select() >> >> The rest mainly come from 2 threads "New I/O" >> org.jboss.netty.channel.socket.nio.SelectorUtil.select() >> sun.nio.ch.SelectorImpl.select() >> >> Therefore I am wondering if the problem can come from one of the >> followings : >> >> - Zookeeper cluster version is 3.4.6, which is different from the 3.3.x >> used by Storm 0.9.1-incubating ? >> But that is strange because there are absolutely no problem when using >> the same settings but with only 1 worker >> >> - Communication layer is netty, which can be not working well with my >> hardware ? (is this possible?) >> In case of 1 worker only netty seems not to be too much involved (no >> inter worker communication) >> Maybe changing to ZeroMQ ? >> >> Has someone faced similar issue ? Any pointer ? Or anything in particular >> to monitor / profile ? >> > >
