[jira] [Created] (STORM-3107) Nimbus confused about leadership after crash
Aaron Gresch created STORM-3107: --- Summary: Nimbus confused about leadership after crash Key: STORM-3107 URL: https://issues.apache.org/jira/browse/STORM-3107 Project: Apache Storm Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aaron Gresch Nimbus crashed and restarted without shutting down zookeeper due to a deadlock in the timer shutdown code. This could however also happen for various other issues. The problem is that once Nimbus restarted, it was really confused about who the leader was: {code:java} 2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for leader lock. 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping assignments 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping cleanup 2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping credential renewal. 2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology submission exception. (topology name='topology-testOverSubscribe-1') java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] 2018-05-24 09:27:40.771 o.a.s.b.BlobStoreUtils Timer-1 [ERROR] Could not download the blob with key: topology-testOverCapacityScheduling-2-1519992333-stormcode.ser 2018-05-24 09:27:40.771 o.a.t.s.TThreadPoolServer pool-37-thread-63 [ERROR] Error occurred during processing of message. java.lang.RuntimeException: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2961) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] Caused by: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y] ... 9 more {code} The session timeout was set to 20 seconds, but we're exceeding this period, and Nimbus did not recover leadership. It needed to be restarted manually to recover. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3101) Fix unexpected metrics registration in StormMetricsRegistry
[ https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3101: --- Summary: Fix unexpected metrics registration in StormMetricsRegistry (was: Select Registry metrics by running daemon) > Fix unexpected metrics registration in StormMetricsRegistry > --- > > Key: STORM-3101 > URL: https://issues.apache.org/jira/browse/STORM-3101 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Metrics that are registered using StormMetricRegistry all added through > static method from the registry class, and attached to a singleton > MetricRegistry object per process. Currently most metrics are bound to > classes (static), so the issue occurs when metrics from irrelevant components > are accidentally registered in class initialization phase. > For example, a process running supervisor daemon will incorrectly register > metrics from nimbus when BasicContainer class is initialized and statically > imports > "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", > which triggers initialization of Nimbus class and all metrics registration, > even though no functionalities of nimbus daemon will be used and no nimbus > metrics will be updated. > This creates many garbage metrics and makes metrics hard to read. Therefore > we should filter metrics registration by the type of daemon that the process > actually runs. > For implementation please see the pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3105) Update storm hive dependency version
[ https://issues.apache.org/jira/browse/STORM-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-3105: -- Labels: pull-request-available (was: ) > Update storm hive dependency version > > > Key: STORM-3105 > URL: https://issues.apache.org/jira/browse/STORM-3105 > Project: Apache Storm > Issue Type: Task >Reporter: Abhishek >Priority: Major > Labels: pull-request-available > > Storm hive is compiled with \{{hive.version}} > [{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292] > which is quite old. The current stable hive release is {{3.0.0}}. Using > {{storm-hive}} with such an old version causes issues when hive cluster > version is newer. See this > [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E] > for one such issue. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3106) Clarify hive version requirement in storm-hive documentation
Abhishek created STORM-3106: --- Summary: Clarify hive version requirement in storm-hive documentation Key: STORM-3106 URL: https://issues.apache.org/jira/browse/STORM-3106 Project: Apache Storm Issue Type: Improvement Reporter: Abhishek {{[storm-hive|https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/external/storm-hive/pom.xml]}} uses 3 [dependencies|https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/external/storm-hive/pom.xml#L54] from {{org.apache.hive.hcatalog}}. All the dependencies are compiled with {{hive.version}} hardcoded [here|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292]. If {{hive.version}} doesn't match the hive cluster's version there could be issues because of version mismatch such as [this|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]. I see two ways of fixing this - * Allow users to specify {{hive.version}} when using storm-hive dependency. * Update {{storm-hive}} documentation to state that user's need to copy the 3 dependencies needed by \{{storm-hive}} and specify the correct hive version for them. Open to any other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3105) Update storm hive dependency version
[ https://issues.apache.org/jira/browse/STORM-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek updated STORM-3105: Description: Storm hive is compiled with \{{hive.version}} [{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292] which is quite old. The current stable hive release is {{3.0.0}}. Using {{storm-hive}} with such an old version causes issues when hive cluster version is newer. See this [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E] for one such issue. was: Storm hive is compiled with \{{hive.version}} {{0.14.0}} which is quite old. The current stable hive release is {{3.0.0}}. Using {{storm-hive}} with such an old version causes issues when hive cluster version is newer. See this [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E] for one such issue. > Update storm hive dependency version > > > Key: STORM-3105 > URL: https://issues.apache.org/jira/browse/STORM-3105 > Project: Apache Storm > Issue Type: Task >Reporter: Abhishek >Priority: Major > > Storm hive is compiled with \{{hive.version}} > [{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292] > which is quite old. The current stable hive release is {{3.0.0}}. Using > {{storm-hive}} with such an old version causes issues when hive cluster > version is newer. See this > [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E] > for one such issue. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3105) Update storm hive dependency version
Abhishek created STORM-3105: --- Summary: Update storm hive dependency version Key: STORM-3105 URL: https://issues.apache.org/jira/browse/STORM-3105 Project: Apache Storm Issue Type: Task Reporter: Abhishek Storm hive is compiled with \{{hive.version}} {{0.14.0}} which is quite old. The current stable hive release is {{3.0.0}}. Using {{storm-hive}} with such an old version causes issues when hive cluster version is newer. See this [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E] for one such issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null and different from new assignment (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample (relevant transition starting from 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]): {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample (relevant transition starting from 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]): {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Sl
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... *2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false **2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... **2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE *2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_