More information (this time from my Zookeeper cluster):

The Zookeeper logs seem pretty calm except from this fairly repetitive
message:

2015-04-13 16:22:36,000 [myid:1] - INFO  [SessionTracker:ZooKeeperServer@347]
- Expiring session 0x34cb217be0500b6, timeout of 20000ms exceeded
2015-04-13 16:22:36,001 [myid:1] - INFO  [ProcessThread(sid:1
cport:-1)::PrepRequestProcessor@494] - Processed session termination for
sessionid: 0x34cb217be0500b6
2015-04-13 16:22:55,379 [myid:1] - INFO  [ProcessThread(sid:1
cport:-1)::PrepRequestProcessor@494] - Processed session termination for
sessionid: 0x34cb217be0500b7
2015-04-13 16:22:55,406 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /[STORM_SUPERVISOR_IP]:51441
2015-04-13 16:22:55,412 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old
client /[STORM_SUPERVISOR_IP]:51441; will be dropped if server is in r-o
mode
2015-04-13 16:22:55,413 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish
new session at /[STORM_SUPERVISOR_IP]:51441
2015-04-13 16:22:55,414 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14cb217be0300ca with negotiated timeout 20000 for client
/[STORM_SUPERVISOR_IP]:51441
2015-04-13 16:22:56,222 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x14cb217be0300ca, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:745)
2015-04-13 16:22:56,223 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /[STORM_SUPERVISOR_IP]:51441 which had sessionid 0x14cb217be0300ca
2015-04-13 16:23:05,827 [myid:1] - INFO  [ProcessThread(sid:1
cport:-1)::PrepRequestProcessor@494] - Processed session termination for
sessionid: 0x34cb217be0500b8
2015-04-13 16:23:05,868 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /[STORM_SUPERVISOR_IP]:51443
2015-04-13 16:23:05,873 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old
client /[STORM_SUPERVISOR_IP]:51443; will be dropped if server is in r-o
mode

Where STORM_SUPERVISOR_IP is the IP of one of the Storm supervisors
(probably didn't need to say that).

Following on from that - running `echo stat | nc localhost 2181` on one of
the Zookeeper nodes shows connections by localhost and by the Kafka cluster
but not by the Storm cluster. Running the Zookeeper cli shows that a Storm
folder has been created with the following subfolders:

[workerbeats, errors, supervisors, storms, assignments]

Don't know if that helps at all.

On 13 April 2015 at 17:12, Matthew Sharpe <[email protected]> wrote:

> Not long at all - certainly less than couple of minutes but longer than a
> few seconds, the couple I actually measured lasted 42 seconds followed by
> about 5 seconds. So the heartbeats folder behaves like so...
>
> /app/storm/workers/[uuid] is created with an empty heartbeats folder
> inside it
> /app/storm/workers/[uuid]/heartbeats has two timestamps added (a couple of
> examples:
>
> -rw-r--r-- 1 storm storm    0 Apr 13 16:03 1428941026297.version
> -rw-r--r-- 1 storm storm 2.2K Apr 13 16:03 1428941026297
> -rw-r--r-- 1 storm storm    0 Apr 13 16:03 1428941027240.version
> -rw-r--r-- 1 storm storm 2.2K Apr 13 16:03 1428941027240
>
> -rw-r--r-- 1 storm storm    0 Apr 13 16:09 1428941398235.version
> -rw-r--r-- 1 storm storm 2.2K Apr 13 16:09 1428941398235
> -rw-r--r-- 1 storm storm    0 Apr 13 16:09 1428941399163.version
> -rw-r--r-- 1 storm storm 2.2K Apr 13 16:09 1428941399163
> )
>
> /app/storm/workers/[uuid] is deleted.
>
> I've not seen any heartbeats folder with any more than 2 heartbeats in.
>
>
>
> On 13 April 2015 at 17:00, Nathan Leung <[email protected]> wrote:
>
>> It only matters what user supervisor is run as, so you should be ok.  How
>> long is it between the worker being started, and it getting killed?  What
>> are the timestamps of the heartbeats in the corresponding worker folder?
>>
>> On Mon, Apr 13, 2015 at 11:47 AM, Matthew Sharpe <
>> [email protected]> wrote:
>>
>>> I'm running storm as a (specially created) storm user - on both the
>>> nimbus and the supervisors the process is showing as being run by the storm
>>> user. And I've got the storm local directory set as /app/storm - owned by
>>> storm user and with permissions 700. Thought it might be due to me
>>> submitting the job as a different user but have just terminated the old job
>>> and submitted the new one as the storm user and still not getting any
>>> tuples coming through.
>>>
>>> For what it's worth, I've looked in the local data directory on the
>>> supervisors and in the workers folder there is a (non-empty) heartbeats
>>> folder being created - don't know if that's indicative of heartbeats being
>>> processed correctly or if that's created at the start of a worker process.
>>>
>>> Is there anything else you want to see that would help you diagnose
>>> this? I can share the UI with you if you think that'd help?
>>>
>>> Thanks!
>>>
>>>
>>> On 13 April 2015 at 16:22, Nathan Leung <[email protected]> wrote:
>>>
>>>> Sounds like it's having problems with heartbeats.  As I recall worker
>>>> -> supervisor heartbeats are written using the local filesystem.  What user
>>>> are you running as?  In your storm.yaml file, you will see the option
>>>> "storm.local.dir".  Does the user supervisor / workers are running as have
>>>> permissions to write to this folder?
>>>>
>>>> On Mon, Apr 13, 2015 at 11:17 AM, Matthew Sharpe <
>>>> [email protected]> wrote:
>>>>
>>>>> New bonus bit of information - the supervisor log has got this going
>>>>> on:
>>>>>
>>>>> 2015-04-13 15:13:48 b.s.d.supervisor [INFO]
>>>>> 3b092266-c3a0-499b-b4e9-bd2c895c5231 still hasn't started
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO] Shutting down and clearing
>>>>> state for id 3b092266-c3a0-499b-b4e9-bd2c895c5231. Current supervisor 
>>>>> time:
>>>>> 1428938060. State: :timed-out, Heartbeat:
>>>>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1428938029,
>>>>> :storm-id "production-topology-2-1428932653", :executors #{[2 2] [3 3] [4
>>>>> 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14]
>>>>> [15 15] [-1 -1] [1 1]}, :port 6703}
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO] Shutting down
>>>>> bd340179-5905-4233-ad5d-44b47b228177:3b092266-c3a0-499b-b4e9-bd2c895c5231
>>>>> 2015-04-13 15:14:20 b.s.util [INFO] Error when trying to kill 5284.
>>>>> Process is probably already dead.
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO] Shut down
>>>>> bd340179-5905-4233-ad5d-44b47b228177:3b092266-c3a0-499b-b4e9-bd2c895c5231
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO] Launching worker with
>>>>> assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
>>>>> "production-topology-2-1428932653", :executors ([2 2] [3 3] [4 4] [5 5] [6
>>>>> 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [1
>>>>> 1])} for this supervisor bd340179-5905-4233-ad5d-44b47b228177 on port 6703
>>>>> with id 786a62ac-2868-4526-a7b2-5c26234bb23e
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO] Launching worker with
>>>>> command: java -server -Xmx768m -Djava.net.preferIPv4Stack=true
>>>>>  -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib
>>>>> -Dlogfile.name=worker-6703.log -Dstorm.home=/usr/local/storm-0.9.0-wip21
>>>>> -Dlogback.configurationFile=/usr/local/storm-0.9.0-wip21/logback/cluster.xml
>>>>> -Dstorm.id=production-topology-2-1428932653
>>>>> -Dworker.id=786a62ac-2868-4526-a7b2-5c26234bb23e -Dworker.port=6703 -cp
>>>>> /usr/local/storm-0.9.0-wip21/storm-console-logging-0.9.0-wip21.jar:/usr/local/storm-0.9.0-wip21/storm-core-0.9.0-wip21.jar:/usr/local/storm-0.9.0-wip21/storm-netty-0.9.0-wip21.jar:/usr/local/storm-0.9.0-wip21/lib/jetty-6.1.26.jar:/usr/local/storm-0.9.0-wip21/lib/jline-0.9.94.jar:/usr/local/storm-0.9.0-wip21/lib/zookeeper-3.3.3.jar:/usr/local/storm-0.9.0-wip21/lib/curator-framework-1.0.1.jar:/usr/local/storm-0.9.0-wip21/lib/logback-core-1.0.6.jar:/usr/local/storm-0.9.0-wip21/lib/servlet-api-2.5.jar:/usr/local/storm-0.9.0-wip21/lib/tools.macro-0.1.0.jar:/usr/local/storm-0.9.0-wip21/lib/logback-classic-1.0.6.jar:/usr/local/storm-0.9.0-wip21/lib/commons-io-1.4.jar:/usr/local/storm-0.9.0-wip21/lib/objenesis-1.2.jar:/usr/local/storm-0.9.0-wip21/lib/slf4j-api-1.6.5.jar:/usr/local/storm-0.9.0-wip21/lib/minlog-1.2.jar:/usr/local/storm-0.9.0-wip21/lib/curator-client-1.0.1.jar:/usr/local/storm-0.9.0-wip21/lib/hiccup-0.3.6.jar:/usr/local/storm-0.9.0-wip21/lib/asm-4.0.jar:/usr/local/storm-0.9.0-wip21/lib/junit-3.8.1.jar:/usr/local/storm-0.9.0-wip21/lib/tools.logging-0.2.3.jar:/usr/local/storm-0.9.0-wip21/lib/kryo-2.17.jar:/usr/local/storm-0.9.0-wip21/lib/httpcore-4.1.jar:/usr/local/storm-0.9.0-wip21/lib/clout-1.0.1.jar:/usr/local/storm-0.9.0-wip21/lib/reflectasm-1.07-shaded.jar:/usr/local/storm-0.9.0-wip21/lib/disruptor-2.10.1.jar:/usr/local/storm-0.9.0-wip21/lib/commons-exec-1.1.jar:/usr/local/storm-0.9.0-wip21/lib/joda-time-2.0.jar:/usr/local/storm-0.9.0-wip21/lib/core.incubator-0.1.0.jar:/usr/local/storm-0.9.0-wip21/lib/servlet-api-2.5-20081211.jar:/usr/local/storm-0.9.0-wip21/lib/ring-devel-0.3.11.jar:/usr/local/storm-0.9.0-wip21/lib/math.numeric-tower-0.0.1.jar:/usr/local/storm-0.9.0-wip21/lib/ring-core-1.1.5.jar:/usr/local/storm-0.9.0-wip21/lib/ring-jetty-adapter-0.3.11.jar:/usr/local/storm-0.9.0-wip21/lib/jgrapht-0.8.3.jar:/usr/local/storm-0.9.0-wip21/lib/commons-logging-1.1.1.jar:/usr/local/storm-0.9.0-wip21/lib/commons-fileupload-1.2.1.jar:/usr/local/storm-0.9.0-wip21/lib/guava-13.0.jar:/usr/local/storm-0.9.0-wip21/lib/log4j-over-slf4j-1.6.6.jar:/usr/local/storm-0.9.0-wip21/lib/ring-servlet-0.3.11.jar:/usr/local/storm-0.9.0-wip21/lib/jetty-util-6.1.26.jar:/usr/local/storm-0.9.0-wip21/lib/mockito-all-1.9.5.jar:/usr/local/storm-0.9.0-wip21/lib/clojure-1.4.0.jar:/usr/local/storm-0.9.0-wip21/lib/libthrift7-0.7.0-2.jar:/usr/local/storm-0.9.0-wip21/lib/snakeyaml-1.11.jar:/usr/local/storm-0.9.0-wip21/lib/netty-3.6.3.Final.jar:/usr/local/storm-0.9.0-wip21/lib/json-simple-1.1.jar:/usr/local/storm-0.9.0-wip21/lib/clj-stacktrace-0.2.2.jar:/usr/local/storm-0.9.0-wip21/lib/commons-lang-2.5.jar:/usr/local/storm-0.9.0-wip21/lib/commons-codec-1.4.jar:/usr/local/storm-0.9.0-wip21/lib/clj-time-0.4.1.jar:/usr/local/storm-0.9.0-wip21/lib/carbonite-1.5.0.jar:/usr/local/storm-0.9.0-wip21/lib/compojure-1.1.3.jar:/usr/local/storm-0.9.0-wip21/lib/jzmq-2.1.0.jar:/usr/local/storm-0.9.0-wip21/lib/httpclient-4.1.1.jar:/usr/local/storm-0.9.0-wip21/lib/tools.cli-0.2.2.jar:/usr/local/storm-0.9.0-wip21/conf:/app/storm/supervisor/stormdist/production-topology-2-1428932653/stormjar.jar
>>>>> backtype.storm.daemon.worker production-topology-2-1428932653
>>>>> bd340179-5905-4233-ad5d-44b47b228177 6703
>>>>> 786a62ac-2868-4526-a7b2-5c26234bb23e
>>>>> 2015-04-13 15:14:20 b.s.d.supervisor [INFO]
>>>>> 786a62ac-2868-4526-a7b2-5c26234bb23e still hasn't started
>>>>>
>>>>> On 13 April 2015 at 16:09, Matthew Sharpe <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> OK, in checking the worker process logs I've uncovered a bit more
>>>>>> information. The UI reports 0 emitted from the wordGenerator spout, along
>>>>>> with 0 transferred, acked and failed.
>>>>>>
>>>>>> Sorry - the *.compute.amazonaws.com is redacted (don't know the
>>>>>> usual way of doing that), I've got a bunch of other stuff on the boxes 
>>>>>> and
>>>>>> in troubleshooting things out have been a bit laxer with security than I
>>>>>> should have been!
>>>>>>
>>>>>> The UI shows 5 workers all starting on the same node - each of which
>>>>>> has a declared uptime of 1s. Looking at the logs on the node, during that
>>>>>> time the following happens:
>>>>>>
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:java.library.path=/usr/local/lib:/opt/local/lib:/usr/lib
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:java.io.tmpdir=/tmp
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:java.compiler=<NA>
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server environment:
>>>>>> os.name=Linux
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:os.arch=amd64
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:os.version=3.13.0-48-generic
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server environment:
>>>>>> user.name=storm
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:user.home=/app/home/storm
>>>>>> 2015-04-13 14:58:48 o.a.z.s.ZooKeeperServer [INFO] Server
>>>>>> environment:user.dir=/
>>>>>> 2015-04-13 14:58:50 b.s.d.worker [INFO] Launching worker for
>>>>>> production-topology-2-1428932653 on
>>>>>> bd340179-5905-4233-ad5d-44b47b228177:6702 with id
>>>>>> ecbd85cc-c495-4f4d-af28-c3277e3e7531 and conf {"dev.zookeeper.path"
>>>>>> "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil,
>>>>>> "topology.builtin.metrics.bucket.size.secs" 60,
>>>>>> "topology.fall.back.on.java.serialization" true,
>>>>>> "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000,
>>>>>> "topology.skip.missing.kryo.registrations" false, "ui.childopts" 
>>>>>> "-Xmx768m
>>>>>> -Djava.net.preferIPv4Stack=true", "storm.zookeeper.session.timeout" 
>>>>>> 20000,
>>>>>> "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 
>>>>>> 500,
>>>>>> "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m",
>>>>>> "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib",
>>>>>> "topology.executor.send.buffer.size" 1024, "storm.local.dir" 
>>>>>> "/app/storm",
>>>>>> "supervisor.worker.start.timeout.secs" 120,
>>>>>> "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs"
>>>>>> 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64,
>>>>>> "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "*.
>>>>>> compute.amazonaws.com", "storm.zookeeper.port" 2181,
>>>>>> "transactional.zookeeper.port" nil, 
>>>>>> "topology.executor.receive.buffer.size"
>>>>>> 1024, "transactional.zookeeper.servers" nil, "storm.zookeeper.root"
>>>>>> "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000,
>>>>>> "supervisor.enable" true, "storm.zookeeper.servers" ["*.
>>>>>> compute.amazonaws.com" "*.compute.amazonaws.com" "*.
>>>>>> compute.amazonaws.com"], "transactional.zookeeper.root"
>>>>>> "/transactional", "topology.acker.executors" nil,
>>>>>> "topology.transfer.buffer.size" 1024, "topology.worker.childopts" nil,
>>>>>> "drpc.queue.size" 128, "worker.childopts" "-Xmx768m
>>>>>> -Djava.net.preferIPv4Stack=true", "supervisor.heartbeat.frequency.secs" 
>>>>>> 5,
>>>>>> "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 
>>>>>> 3772,
>>>>>> "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m",
>>>>>> "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3,
>>>>>> "topology.tasks" nil, "topology.spout.wait.strategy"
>>>>>> "backtype.storm.spout.SleepSpoutWaitStrategy", 
>>>>>> "topology.max.spout.pending"
>>>>>> nil, "storm.zookeeper.retry.interval" 1000, "
>>>>>> topology.sleep.spout.wait.strategy.time.ms" 1,
>>>>>> "nimbus.topology.validator"
>>>>>> "backtype.storm.nimbus.DefaultTopologyValidator", 
>>>>>> "supervisor.slots.ports"
>>>>>> [6700 6701 6702 6703], "topology.debug" false, "nimbus.task.launch.secs"
>>>>>> 120, "nimbus.supervisor.timeout.secs" 60, "topology.message.timeout.secs"
>>>>>> 30, "task.refresh.poll.secs" 10, "topology.workers" 1,
>>>>>> "supervisor.childopts" "-Djava.net.preferIPv4Stack=true",
>>>>>> "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05,
>>>>>> "worker.heartbeat.frequency.secs" 1, "topology.tuple.serializer"
>>>>>> "backtype.storm.serialization.types.ListDelegateSerializer",
>>>>>> "topology.disruptor.wait.strategy"
>>>>>> "com.lmax.disruptor.BlockingWaitStrategy", "nimbus.task.timeout.secs" 30,
>>>>>> "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory"
>>>>>> "backtype.storm.serialization.DefaultKryoFactory", 
>>>>>> "drpc.invocations.port"
>>>>>> 3773, "logviewer.port" 8000, "zmq.threads" 1, 
>>>>>> "storm.zookeeper.retry.times"
>>>>>> 5, "storm.thrift.transport"
>>>>>> "backtype.storm.security.auth.SimpleTransportPlugin",
>>>>>> "topology.state.synchronization.timeout.secs" 60,
>>>>>> "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs"
>>>>>> 600, "storm.messaging.transport" "backtype.storm.messaging.zmq",
>>>>>> "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port"
>>>>>> 8080, "nimbus.childopts" "-Xmx1024m -Djava.net.preferIPv4Stack=true",
>>>>>> "storm.cluster.mode" "distributed", "topology.optimize" true,
>>>>>> "topology.max.task.parallelism" nil}
>>>>>> 2015-04-13 14:58:50 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
>>>>>> 2015-04-13 14:58:50 o.a.z.ZooKeeper [INFO] Initiating client
>>>>>> connection, connectString=*.compute.amazonaws.com:2181,*.
>>>>>> compute.amazonaws.com:2181,*.compute.amazonaws.com:2181
>>>>>> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@585bd6a6
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Opening socket connection
>>>>>> to server *.compute.amazonaws.com/*:2181
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Socket connection
>>>>>> established to *.compute.amazonaws.com/*:2181, initiating session
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Session establishment
>>>>>> complete on server *.compute.amazonaws.com/*:2181, sessionid =
>>>>>> 0x34cb217be05005e, negotiated timeout = 20000
>>>>>> 2015-04-13 14:58:50 b.s.zookeeper [INFO] Zookeeper state update:
>>>>>> :connected:none
>>>>>> 2015-04-13 14:58:50 o.a.z.ZooKeeper [INFO] Session: 0x34cb217be05005e
>>>>>> closed
>>>>>> 2015-04-13 14:58:50 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
>>>>>> 2015-04-13 14:58:50 o.a.z.ZooKeeper [INFO] Initiating client
>>>>>> connection, connectString=*.compute.amazonaws.com:2181,*.
>>>>>> compute.amazonaws.com:2181,*.compute.amazonaws.com:2181/storm
>>>>>> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@4b492c47
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] EventThread shut down
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Opening socket connection
>>>>>> to server *.compute.amazonaws.com/*:2181
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Socket connection
>>>>>> established to *.compute.amazonaws.com/*:2181, initiating session
>>>>>> 2015-04-13 14:58:50 o.a.z.ClientCnxn [INFO] Session establishment
>>>>>> complete on server *.compute.amazonaws.com/*:2181, sessionid =
>>>>>> 0x24cb217be030072, negotiated timeout = 20000
>>>>>> 2015-04-13 14:58:50 b.s.m.TransportFactory [INFO] Storm peer
>>>>>> transport plugin:backtype.storm.messaging.zmq
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor counter:[2
>>>>>> 2]
>>>>>>
>>>>>>
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> wordGenerator:[11 11]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> wordGenerator:[11 11]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opening spout
>>>>>> wordGenerator:(11)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opened spout
>>>>>> wordGenerator:(11)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Activating spout
>>>>>> wordGenerator:(11)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor
>>>>>> wordGenerator:[12 12]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator
>>>>>> __system ["startup"]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> wordGenerator:[12 12]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> wordGenerator:[12 12]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opening spout
>>>>>> wordGenerator:(12)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opened spout
>>>>>> wordGenerator:(12)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Activating spout
>>>>>> wordGenerator:(12)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor
>>>>>> wordGenerator:[13 13]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator
>>>>>> __system ["startup"]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> wordGenerator:[13 13]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> wordGenerator:[13 13]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opening spout
>>>>>> wordGenerator:(13)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opened spout
>>>>>> wordGenerator:(13)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Activating spout
>>>>>> wordGenerator:(13)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor
>>>>>> wordGenerator:[14 14]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator
>>>>>> __system ["startup"]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> wordGenerator:[14 14]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> wordGenerator:[14 14]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [bertels]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opening spout
>>>>>> wordGenerator:(14)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opened spout
>>>>>> wordGenerator:(14)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Activating spout
>>>>>> wordGenerator:(14)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:11, stream: default, id: {}, [bertels]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [mike]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:12, stream: default, id: {}, [mike]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [bertels]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:13, stream: default, id: {}, [bertels]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor
>>>>>> wordGenerator:[15 15]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator
>>>>>> __system ["startup"]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> wordGenerator:[15 15]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> wordGenerator:[15 15]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opening spout
>>>>>> wordGenerator:(15)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Opened spout
>>>>>> wordGenerator:(15)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Activating spout
>>>>>> wordGenerator:(15)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor
>>>>>> __system:[-1 -1]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: __system __system
>>>>>> ["startup"]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> __system:[-1 -1]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Finished loading executor
>>>>>> __system:[-1 -1]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Preparing bolt __system:(-1)
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Loading executor __acker:[1
>>>>>> 1]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [jackson]
>>>>>> 2015-04-13 14:58:51 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [mike]
>>>>>> 2015-04-13 14:58:51 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:11, stream: default, id: {}, [jackson]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:14, stream: default, id: {}, [mike]
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [jackson]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:12, stream: default, id: {}, [jackson]
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [golda]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:13, stream: default, id: {}, [golda]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Prepared bolt __system:(-1)
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: __acker __system
>>>>>> ["startup"]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Loaded executor tasks
>>>>>> __acker:[1 1]
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [jackson]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Finished loading executor
>>>>>> __acker:[1 1]
>>>>>> 2015-04-13 14:58:52 b.s.d.worker [INFO] Launching receive-thread for
>>>>>> bd340179-5905-4233-ad5d-44b47b228177:6702
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Preparing bolt __acker:(1)
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Prepared bolt __acker:(1)
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [nathan]
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [bertels]
>>>>>> 2015-04-13 14:58:52 b.s.d.task [INFO] Emitting: wordGenerator default
>>>>>> [golda]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:11, stream: default, id: {}, [golda]
>>>>>> 2015-04-13 14:58:52 b.s.d.executor [INFO] Processing received message
>>>>>> source: wordGenerator:12, stream: default, id: {}, [nathan]
>>>>>>
>>>>>> After that, the logs stop (no more messages) and the UI says that the
>>>>>> 5 executors are now running on another node (and port) - at which point 
>>>>>> the
>>>>>> worker log is created on that node and with that number (same messages).
>>>>>>
>>>>>> So the worker seems to start and start emitting words but the UI
>>>>>> isn't registered anything as having been emitted. After a short period of
>>>>>> time the worker seems to stop without logging anything and so a new 
>>>>>> worker
>>>>>> is started.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 13 April 2015 at 15:36, Nathan Leung <[email protected]> wrote:
>>>>>>
>>>>>>> Did you check the worker process logs?  What are the emit counts for
>>>>>>> the components in the UI?
>>>>>>>
>>>>>>> Also I have not run on AWS personally but the following line seems
>>>>>>> strange to me:
>>>>>>>
>>>>>>> :node->host {"bd340179-5905-4233-ad5d-44b47b228177" "*.
>>>>>>> compute.amazonaws.com"}
>>>>>>>
>>>>>>> Generally this would be bound to a specific machine, not *.<domain>.
>>>>>>>
>>>>>>> On Mon, Apr 13, 2015 at 10:22 AM, Matthew Sharpe <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've set up a small Storm cluster on AWS that is giving me
>>>>>>>> headaches. At the moment I've got a situation where the cluster (a 
>>>>>>>> nimbus
>>>>>>>> node and 3 supervisors) will successfully run in local mode my 
>>>>>>>> particular
>>>>>>>> topology. However, as soon as I try to run the topology in distributed 
>>>>>>>> mode
>>>>>>>> there are no errors (that I can see) in any of the logs - the topology
>>>>>>>> appears in the UI but never shows any tuples emitted by the spouts.
>>>>>>>>
>>>>>>>> In order to check it wasn't my topology I cloned the storm-starter
>>>>>>>> project and have tried with the RollingTopWords example - running 
>>>>>>>> locally I
>>>>>>>> get a stream of output but as soon as I specify that we should run in a
>>>>>>>> distributed fashion I get a success message (topology successfully
>>>>>>>> submitted). The topology appears on the UI as successfully submitted 
>>>>>>>> but I
>>>>>>>> see no progress.
>>>>>>>>
>>>>>>>> It's a bit of a weird one as I don't know which logs to share to
>>>>>>>> show the problem as the whole point is I'm not seeing any errors!
>>>>>>>>
>>>>>>>> The last entry in the supervisor logs pertains to the set up of the
>>>>>>>> supervisor (i.e. it's not logged anything about the topologies that 
>>>>>>>> have
>>>>>>>> been submitted)
>>>>>>>>
>>>>>>>> The last entries in the nimbus logs are probably a bit more
>>>>>>>> illuminating:
>>>>>>>>
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Executor
>>>>>>>> production-topology-2-1428932653:[2 2] not alive
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Executor
>>>>>>>> production-topology-2-1428932653:[3 3] not alive
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Executor
>>>>>>>> production-topology-2-1428932653:[4 4] not alive
>>>>>>>> 2015-04-13 14:09:25 b.s.s.EvenScheduler [INFO] Available slots:
>>>>>>>> (["bd340179-5905-4233-ad5d-44b47b228177" 6702]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6701]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6700]
>>>>>>>> ["9de97b6f-3d3f-4797-8ac0-e89501de48d3" 6703]
>>>>>>>> ["9de97b6f-3d3f-4797-8ac0-e89501de48d3" 6702]
>>>>>>>> ["9de97b6f-3d3f-4797-8ac0-e89501de48d3" 6701]
>>>>>>>> ["9de97b6f-3d3f-4797-8ac0-e89501de48d3" 6700]
>>>>>>>> ["2874d18d-e0d1-46e0-a50d-85ce645ec4aa" 6703]
>>>>>>>> ["2874d18d-e0d1-46e0-a50d-85ce645ec4aa" 6702]
>>>>>>>> ["2874d18d-e0d1-46e0-a50d-85ce645ec4aa" 6701]
>>>>>>>> ["2874d18d-e0d1-46e0-a50d-85ce645ec4aa" 6700])
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Reassigning
>>>>>>>> production-topology-2-1428932653 to 1 slots
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Reassign executors: [[2 2]
>>>>>>>> [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 
>>>>>>>> 13]
>>>>>>>> [14 14] [15 15] [1 1]]
>>>>>>>> 2015-04-13 14:09:25 b.s.d.nimbus [INFO] Setting new assignment for
>>>>>>>> topology id production-topology-2-1428932653:
>>>>>>>> #backtype.storm.daemon.common.Assignment{:master-code-dir
>>>>>>>> "/app/storm/nimbus/stormdist/production-topology-2-1428932653", 
>>>>>>>> :node->host
>>>>>>>> {"bd340179-5905-4233-ad5d-44b47b228177" "*.compute.amazonaws.com"},
>>>>>>>> :executor->node+port {[2 2] ["bd340179-5905-4233-ad5d-44b47b228177" 
>>>>>>>> 6702],
>>>>>>>> [3 3] ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [4 4]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [5 5]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [6 6]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [7 7]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [8 8]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [9 9]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [10 10]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [11 11]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [12 12]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [13 13]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [14 14]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [15 15]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702], [1 1]
>>>>>>>> ["bd340179-5905-4233-ad5d-44b47b228177" 6702]}, 
>>>>>>>> :executor->start-time-secs
>>>>>>>> {[2 2] 1428934165, [3 3] 1428934165, [4 4] 1428934165, [5 5] 
>>>>>>>> 1428934165, [6
>>>>>>>> 6] 1428934165, [7 7] 1428934165, [8 8] 1428934165, [9 9] 1428934165, 
>>>>>>>> [10
>>>>>>>> 10] 1428934165, [11 11] 1428934165, [12 12] 1428934165, [13 13] 
>>>>>>>> 1428934165,
>>>>>>>> [14 14] 1428934165, [15 15] 1428934165, [1 1] 1428934165}}
>>>>>>>>
>>>>>>>> If anybody can help that'd be greatly appreciated!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> --
>>>>>>>> Check out my blog: dogdogfish.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Check out my blog: dogdogfish.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Check out my blog: dogdogfish.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Check out my blog: dogdogfish.com
>>>
>>
>>
>
>
> --
> Check out my blog: dogdogfish.com
>



-- 
Check out my blog: dogdogfish.com

Reply via email to