Hello Zachary, Could you provide nimbus / supervisor logs? Could you use dedicated ZK for that cluster and see if it helps?
Thanks, Jungtaek Lim (HeartSaVioR) 2017년 8월 16일 (수) 오전 2:37, Zachary Smith <[email protected]>님이 작성: > Upgraded to 1.1.0, still happening and not sure why. In the supervisor > logs there's a mention of the assignments path being deleted: > > 2017-08-15 11:26:41.688 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got WatchedEvent > state:SyncConnected type:NodeDeleted path:/assignments/excla- > 1-1502809921 for sessionid 0x15d96e85f7123d4 > > However, the nimbus seems fine for the most part. All executors are alive, > the Scheduler indicates all nodes are alive, and then the topology just > stops. Does anyone have any tips for debugging this problem? > > On Tue, Jul 18, 2017 at 2:45 PM, Zachary Smith <[email protected]> > wrote: > >> I can provide more logs if needed. The nimbus doesn't seem to log >> anything about this at all, the last relevant thing in its logs is logging >> the heartbeat cache and assignments. >> >> On Mon, Jul 17, 2017 at 5:51 PM, Zachary Smith <[email protected]> >> wrote: >> >>> Hi all, >>> >>> I am experiencing a problem on a remote cluster which causes a running >>> topology to be killed after around 15-20 minutes. We are deploying to a >>> cluster with a nimbus and three worker nodes, with around 8GB RAM each. >>> This cluster shares a zookeeper cluster with another storm cluster (version >>> 0.9.3) which runs topologies without issue. The storm 0.9.3 cluster has a >>> different zookeeper root. >>> >>> Nothing else is running on the cluster. There doesn't appear to be any >>> error in the logs, the supervisor simply states that it's shutting down the >>> worker, and then the topology is promptly shut down. The worker's state is >>> set to disabled according to the supervisor. We have already tried setting >>> nimbus.task.timeout.secs: 600 >>> nimbus.supervisor.timeout.secs: 600 >>> in the storm.yaml but that did not appear to help. I have enabled gc >>> logging, but it doesn't appear to be logging anything. It would be very >>> surprising to me if system load would be causing this, since the topology >>> does virtually nothing. Can anybody figure out what's going on here? >>> >>> Logs will be posted below. >>> *Our storm.yaml:* >>> >>> nimbus.host: stg1examplestorm2-nimbus01.example.com >>> nimbus.thrift.port: 6627 >>> nimbus.thrift.max_buffer_size: 32000000 >>> storm.local.dir: /data/storm >>> >>> storm.zookeeper.servers: >>> - stg1examplezk-storm01.example.com >>> - stg1examplezk-storm02.example.com >>> - stg1examplezk-storm03.example.com >>> >>> >>> supervisor.slots.ports: >>> - 6700 >>> - 6701 >>> - 6702 >>> - 6703 >>> - 6704 >>> - 6705 >>> - 6706 >>> - 6707 >>> - 6708 >>> - 6709 >>> - 6710 >>> - 6711 >>> - 6712 >>> - 6713 >>> - 6714 >>> - 6715 >>> - 6716 >>> - 6717 >>> - 6718 >>> - 6719 >>> - 6720 >>> >>> ui.port: 6060 >>> >>> storm.log.dir: /var/log/storm >>> >>> storm.zookeeper.root: /storm_0_10 >>> >>> >>> >>> *Topology logs:* >>> >>> [a lot of this and zookeeper responses] >>> 2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to >>> channel [id: 0x6ee64433, /10.202.144.23:42166 => >>> stg1examplestorm2-node02.example.com/10.202.144.20:6720] >>> 2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to >>> channel [id: 0x6ee64433, /10.202.144.23:42166 => >>> stg1examplestorm2-node02.example.com/10.202.144.20:6720] >>> 2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to >>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 >>> 2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to >>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 >>> 2017-07-17 17:03:00.613 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> notification sessionid:0x35d4744d5c10011 >>> 2017-07-17 17:03:00.614 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> WatchedEvent state:SyncConnected type:NodeDeleted >>> path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10011 >>> 2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: EventListener - 0 ms >>> 2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: connection-state-parent-process - 0 ms >>> 2017-07-17 17:03:00.637 b.s.d.worker [INFO] Shutting down worker >>> excla-testcla-1-1500324928 13751591-813e-4721-ac3d-55084e32768f 6720 >>> 2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] closing Netty Client >>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 >>> 2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] waiting up to 600000 ms to >>> send 0 pending messages to >>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 >>> 2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to >>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720 >>> closed >>> 2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] closing Netty Client >>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720 >>> 2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] waiting up to 600000 ms to >>> send 0 pending messages to >>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720 >>> 2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to >>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720 >>> closed >>> 2017-07-17 17:03:00.639 b.s.d.worker [INFO] Shutting down receive thread >>> 2017-07-17 17:03:00.640 b.s.m.n.Client [INFO] creating Netty Client, >>> connecting to stg1examplestorm2-node03.example.com:6720, bufferSize: >>> 5242880 >>> 2017-07-17 17:03:00.640 o.a.s.s.o.a.c.r.ExponentialBackoffRetry [WARN] >>> maxRetries too large (300). Pinning to 29 >>> 2017-07-17 17:03:00.640 b.s.u.StormBoundedExponentialBackoffRetry [INFO] >>> The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300] >>> 2017-07-17 17:03:00.703 b.s.m.loader [INFO] Shutting down >>> receiving-thread: [excla-testcla-1-1500324928, 6720] >>> 2017-07-17 17:03:00.703 b.s.m.loader [INFO] Waiting for >>> receiving-thread:[excla-testcla-1-1500324928, 6720] to die >>> 2017-07-17 17:03:00.704 b.s.m.loader [INFO] Shutdown receiving-thread: >>> [excla-testcla-1-1500324928, 6720] >>> 2017-07-17 17:03:00.704 b.s.d.worker [INFO] Shut down receive thread >>> 2017-07-17 17:03:00.704 b.s.d.worker [INFO] Terminating messaging context >>> 2017-07-17 17:03:00.705 b.s.d.worker [INFO] Shutting down executors >>> 2017-07-17 17:03:00.705 b.s.d.executor [INFO] Shutting down executor >>> word:[12 12] >>> 2017-07-17 17:03:00.706 b.s.util [INFO] Async loop interrupted! >>> 2017-07-17 17:03:00.713 b.s.util [INFO] Async loop interrupted! >>> >>> etc... >>> >>> *supervisor logs:* >>> >>> 2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Syncing processes >>> 2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Assigned executors: >>> {6720 {:storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [16 >>> 16] [13 13] [10 10] [4 4]]}} >>> 2017-07-17 17:02:58.143 b.s.d.supervisor [DEBUG] Allocated: >>> {"e96fc1c2-eeab-40f1-8639-ede11233415b" [:valid {:time-secs 1500325377, >>> :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [-1 -1] [16 >>> 16] [13 13] [10 10] [4 4]], :port 6720}]} >>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> notification sessionid:0x35d4744d5c10010 >>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> WatchedEvent state:SyncConnected type:NodeDeleted >>> path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10010 >>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> notification sessionid:0x35d4744d5c10010 >>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got >>> WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/assignments >>> for sessionid 0x35d4744d5c10010 >>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: EventListener - 0 ms >>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: connection-state-parent-process - 0 ms >>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: EventListener - 0 ms >>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: connection-state-parent-process - 0 ms >>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Reading reply >>> sessionid:0x35d4744d5c10010, packet:: clientPath:null serverPath:null >>> finished:false header:: 482,12 replyHeader:: 482,81609600378,0 request:: >>> '/storm_0_10/assignments,T response:: v{},s{34381319$ >>> 47,34381319647,1491855399679,1491855399679,0,78,0,0,1,0,81609600378} >>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE] >>> Trace: GetChildrenBuilderImpl-Foreground - 1 ms >>> 2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Synchronizing supervisor >>> 2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Storm code map: {} >>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Downloaded storm ids: >>> #{"excla-testcla-1-1500324928"} >>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] All assignment: >>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] New assignment: {} >>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Writing new assignment >>> {} >>> 2017-07-17 17:03:00.748 b.s.d.supervisor [INFO] Removing code for storm >>> id excla-testcla-1-1500324928 >>> 2017-07-17 17:03:00.750 b.s.util [DEBUG] Rmr path >>> /data/storm/supervisor/stormdist/excla-testcla-1-1500324928 >>> 2017-07-17 17:03:00.753 b.s.u.LocalState [DEBUG] New Local State for >>> /data/storm/workers/e96fc1c2-eeab-40f1-8639-ede11233415b/heartbeats >>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Worker >>> e96fc1c2-eeab-40f1-8639-ede11233415b is :disallowed: {:time-secs >>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] >>> [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720} at supervisor time-se >>> cs 1500325380 >>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Syncing processes >>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Assigned executors: {} >>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Allocated: >>> {"e96fc1c2-eeab-40f1-8639-ede11233415b" [:disallowed {:time-secs >>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] >>> [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720}]} >>> 2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down and >>> clearing state for id e96fc1c2-eeab-40f1-8639-ede11233415b. Current >>> supervisor time: 1500325380. State: :disallowed, Heartbeat: {:time-secs >>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [ >>> [7 7] [1 1] [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720} >>> 2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down >>> 404f7dd0-363c-4161-a19e-c95e56cbe96e:e96fc1c2-eeab-40f1-8639-ede11233415b >>> 2017-07-17 17:03:00.755 b.s.config [INFO] GET worker-user >>> e96fc1c2-eeab-40f1-8639-ede11233415b >>> >> >> >
