[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> 

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty supervisor and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> 

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> 

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> 

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Summary: Delayed worker launch due to accidental transitioning in state 
machine  (was: Delayed launch due to accidental transitioning in state machine)

> Delayed worker launch due to accidental transitioning in state machine
> --
>
> Key: STORM-3104
> URL: https://issues.apache.org/jira/browse/STORM-3104
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> There is a comparison in 
> {code:java}
> handleWaitingForBlobUpdate()
> {code}
>  between dynamic state's current assignment and new assignment, which 
> accidentally route back state machine just transitioned from 
> WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is 
> because the current assignment in this case is highly likely to be null (I'm 
> not sure if it's guaranteed) and causes delay for a worker to start/restart.
> The symptom is able to be reproduced by launching an empty supervisor and 
> submit any topology. Here's the log sample:
> {code:sh}
> 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6024 -> EMPTY msInState: 6024
> 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
> EMPTY to WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
> are []
> 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
> LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, 

[jira] [Updated] (STORM-3104) Delayed launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
There is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> 

[jira] [Created] (STORM-3104) Delayed launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)
Zhengdai Hu created STORM-3104:
--

 Summary: Delayed launch due to accidental transitioning in state 
machine
 Key: STORM-3104
 URL: https://issues.apache.org/jira/browse/STORM-3104
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


There is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:shell}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 

[jira] [Updated] (STORM-3103) nimbus stuck shutting down causing leadership issues on startup

2018-06-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-3103:
--
Labels: pull-request-available  (was: )

> nimbus stuck shutting down causing leadership issues on startup
> ---
>
> Key: STORM-3103
> URL: https://issues.apache.org/jira/browse/STORM-3103
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Aaron Gresch
>Assignee: Aaron Gresch
>Priority: Major
>  Labels: pull-request-available
>
> When debugging an Nimbus NPE that caused restarts, I noticed that a forced 
> halt occurred:
>  
> {code:java}
> 2018-05-24 09:27:05.569 o.a.z.ClientCnxn 
> main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Opening 
> socket connection to server 
> openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181. Will attempt to 
> SASL-authenticate using Login Context section 'Client'
> 2018-05-24 09:27:05.570 o.a.z.ClientCnxn 
> main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Socket 
> connection established to 
> openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, initiating session
> 2018-05-24 09:27:05.571 o.a.z.ClientCnxn 
> main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Session 
> establishment complete on server 
> openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, sessionid = 
> 0x1624a86300f7f6b, negotiated timeout = 4
> 2018-05-24 09:27:05.571 o.a.c.f.s.ConnectionStateManager main-EventThread 
> [INFO] State change: CONNECTED
> 2018-05-24 09:27:05.636 o.a.s.d.n.Nimbus main [INFO] Starting nimbus server 
> for storm version '2.0.0.y'
> 2018-05-24 09:27:06.012 o.a.s.d.n.Nimbus timer [ERROR] Error while processing 
> event
> java.lang.RuntimeException: java.lang.NullPointerException
> at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2685)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at org.apache.storm.StormTimer$1.run(StormTimer.java:111) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:227) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
> Caused by: java.lang.NullPointerException
> at 
> org.apache.storm.daemon.nimbus.Nimbus.readAllSupervisorDetails(Nimbus.java:1814)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:1906)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2057) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2003) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2681)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
> ... 2 more
> 2018-05-24 09:27:06.023 o.a.s.u.Utils timer [ERROR] Halting process: Error 
> while processing event
> java.lang.RuntimeException: Halting process: Error while processing event
> at org.apache.storm.utils.Utils.exitProcess(Utils.java:469) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$new$17(Nimbus.java:484) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:252) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
> 2018-05-24 09:27:06.032 o.a.s.d.n.Nimbus Thread-12 [INFO] Shutting down master
> 2018-05-24 09:27:06.032 o.a.s.u.Utils Thread-13 [INFO] Halting after 5 seconds
> {code}
> At times this would cause leadership confusion:
>  
> {code:java}
> 2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for 
> leader lock.
> 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
> assignments
> 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
> cleanup
> 2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
> credential renewal.
> 2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology 
> submission exception. (topology name='topology-testOverSubscribe-1')
> java.lang.RuntimeException: not a leader, current leader is 
> NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
> isLeader=true}
> at 
> org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
> at 
> org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
> at 
> 

[jira] [Commented] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread Andy Seidel (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511559#comment-16511559
 ] 

Andy Seidel commented on STORM-3102:


Opened PRs that remove the check for master, 1.x-branch, and 1.1.x-branch

[https://github.com/apache/storm/pull/271|https://github.com/apache/storm/pull/2716]5

[https://github.com/apache/storm/pull/2716]

https://github.com/apache/storm/pull/2717

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Assignee: Andy Seidel
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-3102:
--
Labels: pull-request-available  (was: )

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Assignee: Andy Seidel
>Priority: Major
>  Labels: pull-request-available
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread Andy Seidel (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seidel reassigned STORM-3102:
--

Assignee: Andy Seidel

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Assignee: Andy Seidel
>Priority: Major
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (STORM-3103) nimbus stuck shutting down causing leadership issues on startup

2018-06-13 Thread Aaron Gresch (JIRA)
Aaron Gresch created STORM-3103:
---

 Summary: nimbus stuck shutting down causing leadership issues on 
startup
 Key: STORM-3103
 URL: https://issues.apache.org/jira/browse/STORM-3103
 Project: Apache Storm
  Issue Type: Bug
Reporter: Aaron Gresch
Assignee: Aaron Gresch


When debugging an Nimbus NPE that caused restarts, I noticed that a forced halt 
occurred:

 
{code:java}
2018-05-24 09:27:05.569 o.a.z.ClientCnxn 
main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Opening 
socket connection to server 
openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181. Will attempt to 
SASL-authenticate using Login Context section 'Client'
2018-05-24 09:27:05.570 o.a.z.ClientCnxn 
main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Socket 
connection established to 
openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, initiating session
2018-05-24 09:27:05.571 o.a.z.ClientCnxn 
main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Session 
establishment complete on server 
openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, sessionid = 
0x1624a86300f7f6b, negotiated timeout = 4
2018-05-24 09:27:05.571 o.a.c.f.s.ConnectionStateManager main-EventThread 
[INFO] State change: CONNECTED
2018-05-24 09:27:05.636 o.a.s.d.n.Nimbus main [INFO] Starting nimbus server for 
storm version '2.0.0.y'
2018-05-24 09:27:06.012 o.a.s.d.n.Nimbus timer [ERROR] Error while processing 
event
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2685) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at org.apache.storm.StormTimer$1.run(StormTimer.java:111) 
~[storm-client-2.0.0.y.jar:2.0.0.y]
at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:227) 
~[storm-client-2.0.0.y.jar:2.0.0.y]
Caused by: java.lang.NullPointerException
at 
org.apache.storm.daemon.nimbus.Nimbus.readAllSupervisorDetails(Nimbus.java:1814)
 ~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:1906)
 ~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2057) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2003) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2681) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
... 2 more
2018-05-24 09:27:06.023 o.a.s.u.Utils timer [ERROR] Halting process: Error 
while processing event
java.lang.RuntimeException: Halting process: Error while processing event
at org.apache.storm.utils.Utils.exitProcess(Utils.java:469) 
~[storm-client-2.0.0.y.jar:2.0.0.y]
at org.apache.storm.daemon.nimbus.Nimbus.lambda$new$17(Nimbus.java:484) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:252) 
~[storm-client-2.0.0.y.jar:2.0.0.y]
2018-05-24 09:27:06.032 o.a.s.d.n.Nimbus Thread-12 [INFO] Shutting down master
2018-05-24 09:27:06.032 o.a.s.u.Utils Thread-13 [INFO] Halting after 5 seconds
{code}
At times this would cause leadership confusion:

 
{code:java}
2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for 
leader lock.
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
assignments
2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
cleanup
2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping 
credential renewal.

2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology 
submission exception. (topology name='topology-testOverSubscribe-1')
java.lang.RuntimeException: not a leader, current leader is 
NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, 
isLeader=true}
at 
org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) 
~[storm-server-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 
org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.3.jar:0.9.3]
at 
org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
 ~[storm-client-2.0.0.y.jar:2.0.0.y]
at 

[jira] [Commented] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511474#comment-16511474
 ] 

Stig Rohde Døssing commented on STORM-3102:
---

Nice find, great job. The change seems to come from 
https://issues.apache.org/jira/browse/KAFKA-5273. Seems like they removed the 
committed offset cache because commits can also be done by the producer now, so 
it was unsafe to cache the committed offsets.

The check is simply there to check that we don't reintroduce a fixed bug 
(https://issues.apache.org/jira/browse/STORM-2666). I felt it made sense to 
have, because it didn't really raise complexity, and it was an extremely cheap 
check. Now that calling KafkaConsumer.committed is expensive, I think the check 
isn't worth it anymore. I don't have a strong opinion on whether we should keep 
it around for testing/debugging and disable it with a flag, or delete it 
outright. I'd probably lean toward deleting it to avoid adding unnecessary 
configuration though.

[~acseidel] If you'd like to make this fix, please open a PR at 
https://github.com/apache/storm/pulls. 

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Priority: Major
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stig Rohde Døssing updated STORM-3102:
--
Affects Version/s: 2.0.0

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Priority: Major
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stig Rohde Døssing updated STORM-3102:
--
Affects Version/s: 1.2.2

> Storm Kafka Client performance issues with Kafka Client v1.0.0
> --
>
> Key: STORM-3102
> URL: https://issues.apache.org/jira/browse/STORM-3102
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka-client
>Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2
>Reporter: Andy Seidel
>Priority: Major
>
> Recently I upgraded our storm topology to use the storm-kafka-client instead 
> of storm-kafka.  After the upgrade in our production environment we saw a 
> significant (2x) reduction in our processing throughput.
> We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 
> server cluster.
> After some investigation, it looks like the issue only occurs when using 
> kafka clients 0.11 or newer.
> In kafka 0.11, the kafka consumer method commited always blocks to make an 
> external call o get the last commited offsets
> [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]
> In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
> partition is not assigned to the consumer
> [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]
>  
> The impact of this is to require every tuple to make blocking remote calls 
> before being emitted.  
> [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]
> Removing this check returns performance to expected levels.
> Looking through the storm-kafka-client code, it is not clear to me the impact 
> of ignoring the check.  In our case we want at least once processing, but for 
> other processing gurantees the call to kafkaConsumer.commited(tp) is not 
> needed, as the value is only looked at if the processing mode is at least 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0

2018-06-13 Thread Andy Seidel (JIRA)
Andy Seidel created STORM-3102:
--

 Summary: Storm Kafka Client performance issues with Kafka Client 
v1.0.0
 Key: STORM-3102
 URL: https://issues.apache.org/jira/browse/STORM-3102
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-kafka-client
Affects Versions: 1.1.3, 1.0.6
Reporter: Andy Seidel


Recently I upgraded our storm topology to use the storm-kafka-client instead of 
storm-kafka.  After the upgrade in our production environment we saw a 
significant (2x) reduction in our processing throughput.

We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 server 
cluster.

After some investigation, it looks like the issue only occurs when using kafka 
clients 0.11 or newer.

In kafka 0.11, the kafka consumer method commited always blocks to make an 
external call o get the last commited offsets

[https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351]

In kafka 0.10.2 the kafka consumer only made the blocking remote call if the 
partition is not assigned to the consumer

[https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311]

 

The impact of this is to require every tuple to make blocking remote calls 
before being emitted.  

[https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473]

Removing this check returns performance to expected levels.

Looking through the storm-kafka-client code, it is not clear to me the impact 
of ignoring the check.  In our case we want at least once processing, but for 
other processing gurantees the call to kafkaConsumer.commited(tp) is not 
needed, as the value is only looked at if the processing mode is at least once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (STORM-3101) Select Registry metrics by running daemon

2018-06-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated STORM-3101:
--
Labels: pull-request-available  (was: )

> Select Registry metrics by running daemon
> -
>
> Key: STORM-3101
> URL: https://issues.apache.org/jira/browse/STORM-3101
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Metrics that are registered using StormMetricRegistry all added through 
> static method from the registry class, and attached to a singleton 
> MetricRegistry object per process. Currently most metrics are bound to 
> classes (static), so the issue occurs when metrics from irrelevant components 
> are accidentally registered in class initialization phase. 
> For example, a process running supervisor daemon will incorrectly register 
> metrics from nimbus when BasicContainer class is initialized and statically 
> imports 
> "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", 
> which triggers initialization of Nimbus class and all metrics registration, 
> even though no functionalities of nimbus daemon will be used and no nimbus 
> metrics will be updated. 
> This creates many garbage metrics and makes metrics hard to read. Therefore 
> we should filter metrics registration by the type of daemon that the process 
> actually runs.
> For implementation please see the pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (STORM-3094) Topology name needs to be validated at storm-client

2018-06-13 Thread Manohar Vanam (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511194#comment-16511194
 ] 

Manohar Vanam commented on STORM-3094:
--

Thanks [~kabhwan], Raised pull request against 1.x-branch.

> Topology name needs to be validated at storm-client
> ---
>
> Key: STORM-3094
> URL: https://issues.apache.org/jira/browse/STORM-3094
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-client
>Reporter: Manohar Vanam
>Assignee: Manohar Vanam
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: patch.diff
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> *Current Behavior :* Execute topology with invalid topology name is throwing 
> exception after uploading the jar.
> *Improvement :* Validating topology name at client side before uploading the 
> jar.
>  
>  
> {code:java}
> 2018-06-05 16:16:19.461 o.a.s.d.n.Nimbus pool-21-thread-53 [INFO] Uploading 
> file from client to 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.596 o.a.s.d.n.Nimbus pool-21-thread-35 [INFO] Finished 
> uploading file from client: 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] Received 
> topology submission for test-[123] (storm-2.0.0-SNAPSHOT JDK-1.8.0_162) with 
> conf {topology.users=[null], topology.acker.executors=null, 
> storm.zookeeper.superACL=null, topology.workers=3, 
> topology.submitter.principal=, topology.debug=true, topology.name=test-[123], 
> topology.kryo.register={}, storm.id=test-[123]-7-1528195580, 
> topology.kryo.decorators=[], topology.eventlogger.executors=0, 
> topology.submitter.user=mvanam, topology.max.task.parallelism=null}
> 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] uploadedJar 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.624 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] 
> 'test-[123]-7-1528195580-stormjar.jar' does not appear to be valid. It must 
> match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string.
> 2018-06-05 16:16:20.625 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] 
> 'test-[123]-7-1528195580-stormconf.ser' does not appear to be valid. It must 
> match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string.
> 2018-06-05 16:16:20.626 o.a.s.d.n.Nimbus pool-21-thread-29 [WARN] Topology 
> submission exception. (topology name='test-[123]')
> java.lang.IllegalArgumentException: test-[123]-7-1528195580-stormconf.ser 
> does not appear to be a valid blob key
>  at org.apache.storm.blobstore.BlobStore.validateKey(BlobStore.java:66) 
> ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (STORM-3094) Topology name needs to be validated at storm-client

2018-06-13 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/STORM-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved STORM-3094.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Thanks [~ManoharVanam], I merged into master.

If you would like to port this back to 1.x, please raise a pull request against 
1.x-branch.

> Topology name needs to be validated at storm-client
> ---
>
> Key: STORM-3094
> URL: https://issues.apache.org/jira/browse/STORM-3094
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-client
>Reporter: Manohar Vanam
>Assignee: Manohar Vanam
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: patch.diff
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> *Current Behavior :* Execute topology with invalid topology name is throwing 
> exception after uploading the jar.
> *Improvement :* Validating topology name at client side before uploading the 
> jar.
>  
>  
> {code:java}
> 2018-06-05 16:16:19.461 o.a.s.d.n.Nimbus pool-21-thread-53 [INFO] Uploading 
> file from client to 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.596 o.a.s.d.n.Nimbus pool-21-thread-35 [INFO] Finished 
> uploading file from client: 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] Received 
> topology submission for test-[123] (storm-2.0.0-SNAPSHOT JDK-1.8.0_162) with 
> conf {topology.users=[null], topology.acker.executors=null, 
> storm.zookeeper.superACL=null, topology.workers=3, 
> topology.submitter.principal=, topology.debug=true, topology.name=test-[123], 
> topology.kryo.register={}, storm.id=test-[123]-7-1528195580, 
> topology.kryo.decorators=[], topology.eventlogger.executors=0, 
> topology.submitter.user=mvanam, topology.max.task.parallelism=null}
> 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] uploadedJar 
> /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar
> 2018-06-05 16:16:20.624 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] 
> 'test-[123]-7-1528195580-stormjar.jar' does not appear to be valid. It must 
> match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string.
> 2018-06-05 16:16:20.625 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] 
> 'test-[123]-7-1528195580-stormconf.ser' does not appear to be valid. It must 
> match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string.
> 2018-06-05 16:16:20.626 o.a.s.d.n.Nimbus pool-21-thread-29 [WARN] Topology 
> submission exception. (topology name='test-[123]')
> java.lang.IllegalArgumentException: test-[123]-7-1528195580-stormconf.ser 
> does not appear to be a valid blob key
>  at org.apache.storm.blobstore.BlobStore.validateKey(BlobStore.java:66) 
> ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)