[jira] [Created] (STORM-3107) Nimbus confused about leadership after crash

2018-06-14 Thread Aaron Gresch (JIRA)
Aaron Gresch created STORM-3107:
---

 Summary: Nimbus confused about leadership after crash
 Key: STORM-3107
 URL: https://issues.apache.org/jira/browse/STORM-3107
 Project: Apache Storm
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Aaron Gresch


Nimbus crashed and restarted without shutting down zookeeper due to a deadlock 
in the timer shutdown code.  This could however also happen for various other 
issues.  

 

The problem is that once Nimbus restarted, it was really confused about who the 
leader was:

 
{code:java}
2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for 
leader lock.
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
assignments
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
cleanup
2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
credential renewal.

2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology 
submission exception. (topology name='topology-testOverSubscribe-1')
java.lang.RuntimeException: not a leader, current leader is 
NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
at 
org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at 
org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 ~[libthrift-0.9.3.jar:0.9.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_131]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
2018-05-24 09:27:40.771 o.a.s.b.BlobStoreUtils Timer-1 [ERROR] Could not 
download the blob with key: 
topology-testOverCapacityScheduling-2-1519992333-stormcode.ser
2018-05-24 09:27:40.771 o.a.t.s.TThreadPoolServer pool-37-thread-63 [ERROR] 
Error occurred during processing of message.
java.lang.RuntimeException: java.lang.RuntimeException: not a leader, current 
leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2961) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at 
org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 ~[libthrift-0.9.3.jar:0.9.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_131]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.RuntimeException: not a leader, current leader is 
NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
at 
org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
... 9 more
{code}
The session timeout was set to 20 seconds, but we're exceeding this period, and 
Nimbus did not recover leadership.  It needed to be restarted manually to 
recover.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3101) Fix unexpected metrics registration in StormMetricsRegistry

2018-06-14 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3101:
---
Summary: Fix unexpected metrics registration in StormMetricsRegistry  (was: 
Select Registry metrics by running daemon)

> Fix unexpected metrics registration in StormMetricsRegistry
> ---
>
> Key: STORM-3101
> URL: https://issues.apache.org/jira/browse/STORM-3101
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Metrics that are registered using StormMetricRegistry all added through 
> static method from the registry class, and attached to a singleton 
> MetricRegistry object per process. Currently most metrics are bound to 
> classes (static), so the issue occurs when metrics from irrelevant components 
> are accidentally registered in class initialization phase. 
> For example, a process running supervisor daemon will incorrectly register 
> metrics from nimbus when BasicContainer class is initialized and statically 
> imports 
> "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", 
> which triggers initialization of Nimbus class and all metrics registration, 
> even though no functionalities of nimbus daemon will be used and no nimbus 
> metrics will be updated. 
> This creates many garbage metrics and makes metrics hard to read. Therefore 
> we should filter metrics registration by the type of daemon that the process 
> actually runs.
> For implementation please see the pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3105) Update storm hive dependency version

2018-06-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-3105:
--
Labels: pull-request-available  (was: )

> Update storm hive dependency version
> 
>
> Key: STORM-3105
> URL: https://issues.apache.org/jira/browse/STORM-3105
> Project: Apache Storm
>  Issue Type: Task
>Reporter: Abhishek
>Priority: Major
>  Labels: pull-request-available
>
> Storm hive is compiled with \{{hive.version}} 
> [{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292]
>  which is quite old. The current stable hive release is {{3.0.0}}. Using 
> {{storm-hive}} with such an old version causes issues when hive cluster 
> version is newer. See this 
> [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]
>  for one such issue. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (STORM-3106) Clarify hive version requirement in storm-hive documentation

2018-06-14 Thread Abhishek (JIRA)
Abhishek created STORM-3106:
---

 Summary: Clarify hive version requirement in storm-hive 
documentation
 Key: STORM-3106
 URL: https://issues.apache.org/jira/browse/STORM-3106
 Project: Apache Storm
  Issue Type: Improvement
Reporter: Abhishek


{{[storm-hive|https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/external/storm-hive/pom.xml]}}
 uses 3 
[dependencies|https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/external/storm-hive/pom.xml#L54]
 from {{org.apache.hive.hcatalog}}. All the dependencies are compiled with 
{{hive.version}} hardcoded 
[here|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292].

If {{hive.version}} doesn't match the hive cluster's version there could be 
issues because of version mismatch such as 
[this|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E].
 I see two ways of fixing this -
 * Allow users to specify {{hive.version}} when using storm-hive dependency.
 * Update {{storm-hive}} documentation to state that user's need to copy the 3 
dependencies needed by \{{storm-hive}} and specify the correct hive version for 
them.

Open to any other ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3105) Update storm hive dependency version

2018-06-14 Thread Abhishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek updated STORM-3105:

Description: 
Storm hive is compiled with \{{hive.version}} 
[{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292]
 which is quite old. The current stable hive release is {{3.0.0}}. Using 
{{storm-hive}} with such an old version causes issues when hive cluster version 
is newer. See this 
[thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]
 for one such issue. 

 

  was:
Storm hive is compiled with \{{hive.version}} {{0.14.0}} which is quite old. 
The current stable hive release is {{3.0.0}}. Using {{storm-hive}} with such an 
old version causes issues when hive cluster version is newer. See this 
[thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]
 for one such issue. 

 


> Update storm hive dependency version
> 
>
> Key: STORM-3105
> URL: https://issues.apache.org/jira/browse/STORM-3105
> Project: Apache Storm
>  Issue Type: Task
>Reporter: Abhishek
>Priority: Major
>
> Storm hive is compiled with \{{hive.version}} 
> [{{0.14.0}}|https://github.com/apache/storm/blob/0eb6b5116f251e17b6f14a61cebfadfc286faa59/pom.xml#L292]
>  which is quite old. The current stable hive release is {{3.0.0}}. Using 
> {{storm-hive}} with such an old version causes issues when hive cluster 
> version is newer. See this 
> [thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]
>  for one such issue. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (STORM-3105) Update storm hive dependency version

2018-06-14 Thread Abhishek (JIRA)
Abhishek created STORM-3105:
---

 Summary: Update storm hive dependency version
 Key: STORM-3105
 URL: https://issues.apache.org/jira/browse/STORM-3105
 Project: Apache Storm
  Issue Type: Task
Reporter: Abhishek


Storm hive is compiled with \{{hive.version}} {{0.14.0}} which is quite old. 
The current stable hive release is {{3.0.0}}. Using {{storm-hive}} with such an 
old version causes issues when hive cluster version is newer. See this 
[thread|https://lists.apache.org/thread.html/36f674f74e73f9ddc83b6b2452acfd9773af4e6394c8cb157dd367fd@%3Cuser.storm.apache.org%3E]
 for one such issue. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null and different from 
new assignment (I'm not sure if it's guaranteed). This causes delay for a 
worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample (relevant transition starting from 2018-06-13 
16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]):

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample (relevant transition starting from 2018-06-13 
16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]):

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Sl

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
*2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
**2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
**2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
*2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_