[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 ->
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 ->
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 ->
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 ->
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Summary: Delayed worker launch due to accidental transitioning in state machine (was: Delayed launch due to accidental transitioning in state machine) > Delayed worker launch due to accidental transitioning in state machine > -- > > Key: STORM-3104 > URL: https://issues.apache.org/jira/browse/STORM-3104 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > There is a comparison in > {code:java} > handleWaitingForBlobUpdate() > {code} > between dynamic state's current assignment and new assignment, which > accidentally route back state machine just transitioned from > WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is > because the current assignment in this case is highly likely to be null (I'm > not sure if it's guaranteed) and causes delay for a worker to start/restart. > The symptom is able to be reproduced by launching an empty supervisor and > submit any topology. Here's the log sample: > {code:sh} > 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6024 -> EMPTY msInState: 6024 > 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from > EMPTY to WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs > are [] > 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING > LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7,
[jira] [Updated] (STORM-3104) Delayed launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: There is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 ->
[jira] [Created] (STORM-3104) Delayed launch due to accidental transitioning in state machine
Zhengdai Hu created STORM-3104: -- Summary: Delayed launch due to accidental transitioning in state machine Key: STORM-3104 URL: https://issues.apache.org/jira/browse/STORM-3104 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 There is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:shell} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275
[jira] [Updated] (STORM-3103) nimbus stuck shutting down causing leadership issues on startup
[ https://issues.apache.org/jira/browse/STORM-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-3103: -- Labels: pull-request-available (was: ) > nimbus stuck shutting down causing leadership issues on startup > --- > > Key: STORM-3103 > URL: https://issues.apache.org/jira/browse/STORM-3103 > Project: Apache Storm > Issue Type: Bug >Reporter: Aaron Gresch >Assignee: Aaron Gresch >Priority: Major > Labels: pull-request-available > > When debugging an Nimbus NPE that caused restarts, I noticed that a forced > halt occurred: > > {code:java} > 2018-05-24 09:27:05.569 o.a.z.ClientCnxn > main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Opening > socket connection to server > openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181. Will attempt to > SASL-authenticate using Login Context section 'Client' > 2018-05-24 09:27:05.570 o.a.z.ClientCnxn > main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Socket > connection established to > openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, initiating session > 2018-05-24 09:27:05.571 o.a.z.ClientCnxn > main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Session > establishment complete on server > openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, sessionid = > 0x1624a86300f7f6b, negotiated timeout = 4 > 2018-05-24 09:27:05.571 o.a.c.f.s.ConnectionStateManager main-EventThread > [INFO] State change: CONNECTED > 2018-05-24 09:27:05.636 o.a.s.d.n.Nimbus main [INFO] Starting nimbus server > for storm version '2.0.0.y' > 2018-05-24 09:27:06.012 o.a.s.d.n.Nimbus timer [ERROR] Error while processing > event > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2685) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at org.apache.storm.StormTimer$1.run(StormTimer.java:111) > ~[storm-client-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:227) > ~[storm-client-2.0.0.y.jar:2.0.0.y] > Caused by: java.lang.NullPointerException > at > org.apache.storm.daemon.nimbus.Nimbus.readAllSupervisorDetails(Nimbus.java:1814) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:1906) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2057) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2003) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2681) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > ... 2 more > 2018-05-24 09:27:06.023 o.a.s.u.Utils timer [ERROR] Halting process: Error > while processing event > java.lang.RuntimeException: Halting process: Error while processing event > at org.apache.storm.utils.Utils.exitProcess(Utils.java:469) > ~[storm-client-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.lambda$new$17(Nimbus.java:484) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:252) > ~[storm-client-2.0.0.y.jar:2.0.0.y] > 2018-05-24 09:27:06.032 o.a.s.d.n.Nimbus Thread-12 [INFO] Shutting down master > 2018-05-24 09:27:06.032 o.a.s.u.Utils Thread-13 [INFO] Halting after 5 seconds > {code} > At times this would cause leadership confusion: > > {code:java} > 2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for > leader lock. > 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping > assignments > 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping > cleanup > 2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping > credential renewal. > 2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology > submission exception. (topology name='topology-testOverSubscribe-1') > java.lang.RuntimeException: not a leader, current leader is > NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, > isLeader=true} > at > org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) > ~[storm-server-2.0.0.y.jar:2.0.0.y] > at > org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) > ~[storm-client-2.0.0.y.jar:2.0.0.y] > at >
[jira] [Commented] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511559#comment-16511559 ] Andy Seidel commented on STORM-3102: Opened PRs that remove the check for master, 1.x-branch, and 1.1.x-branch [https://github.com/apache/storm/pull/271|https://github.com/apache/storm/pull/2716]5 [https://github.com/apache/storm/pull/2716] https://github.com/apache/storm/pull/2717 > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Assignee: Andy Seidel >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-3102: -- Labels: pull-request-available (was: ) > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Assignee: Andy Seidel >Priority: Major > Labels: pull-request-available > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Seidel reassigned STORM-3102: -- Assignee: Andy Seidel > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Assignee: Andy Seidel >Priority: Major > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3103) nimbus stuck shutting down causing leadership issues on startup
Aaron Gresch created STORM-3103: --- Summary: nimbus stuck shutting down causing leadership issues on startup Key: STORM-3103 URL: https://issues.apache.org/jira/browse/STORM-3103 Project: Apache Storm Issue Type: Bug Reporter: Aaron Gresch Assignee: Aaron Gresch When debugging an Nimbus NPE that caused restarts, I noticed that a forced halt occurred: {code:java} 2018-05-24 09:27:05.569 o.a.z.ClientCnxn main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Opening socket connection to server openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181. Will attempt to SASL-authenticate using Login Context section 'Client' 2018-05-24 09:27:05.570 o.a.z.ClientCnxn main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Socket connection established to openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, initiating session 2018-05-24 09:27:05.571 o.a.z.ClientCnxn main-SendThread(openqe82blue-gw.blue.ygrid.yahoo.com:2181) [INFO] Session establishment complete on server openqe82blue-gw.blue.ygrid.yahoo.com/10.215.77.115:2181, sessionid = 0x1624a86300f7f6b, negotiated timeout = 4 2018-05-24 09:27:05.571 o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: CONNECTED 2018-05-24 09:27:05.636 o.a.s.d.n.Nimbus main [INFO] Starting nimbus server for storm version '2.0.0.y' 2018-05-24 09:27:06.012 o.a.s.d.n.Nimbus timer [ERROR] Error while processing event java.lang.RuntimeException: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2685) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.StormTimer$1.run(StormTimer.java:111) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:227) ~[storm-client-2.0.0.y.jar:2.0.0.y] Caused by: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.Nimbus.readAllSupervisorDetails(Nimbus.java:1814) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:1906) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2057) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2003) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2681) ~[storm-server-2.0.0.y.jar:2.0.0.y] ... 2 more 2018-05-24 09:27:06.023 o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event java.lang.RuntimeException: Halting process: Error while processing event at org.apache.storm.utils.Utils.exitProcess(Utils.java:469) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lambda$new$17(Nimbus.java:484) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:252) ~[storm-client-2.0.0.y.jar:2.0.0.y] 2018-05-24 09:27:06.032 o.a.s.d.n.Nimbus Thread-12 [INFO] Shutting down master 2018-05-24 09:27:06.032 o.a.s.u.Utils Thread-13 [INFO] Halting after 5 seconds {code} At times this would cause leadership confusion: {code:java} 2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for leader lock. 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping assignments 2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping cleanup 2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping credential renewal. 2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology submission exception. (topology name='topology-testOverSubscribe-1') java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true} at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y] at
[jira] [Commented] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511474#comment-16511474 ] Stig Rohde Døssing commented on STORM-3102: --- Nice find, great job. The change seems to come from https://issues.apache.org/jira/browse/KAFKA-5273. Seems like they removed the committed offset cache because commits can also be done by the producer now, so it was unsafe to cache the committed offsets. The check is simply there to check that we don't reintroduce a fixed bug (https://issues.apache.org/jira/browse/STORM-2666). I felt it made sense to have, because it didn't really raise complexity, and it was an extremely cheap check. Now that calling KafkaConsumer.committed is expensive, I think the check isn't worth it anymore. I don't have a strong opinion on whether we should keep it around for testing/debugging and disable it with a flag, or delete it outright. I'd probably lean toward deleting it to avoid adding unnecessary configuration though. [~acseidel] If you'd like to make this fix, please open a PR at https://github.com/apache/storm/pulls. > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Priority: Major > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stig Rohde Døssing updated STORM-3102: -- Affects Version/s: 2.0.0 > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Priority: Major > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
[ https://issues.apache.org/jira/browse/STORM-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stig Rohde Døssing updated STORM-3102: -- Affects Version/s: 1.2.2 > Storm Kafka Client performance issues with Kafka Client v1.0.0 > -- > > Key: STORM-3102 > URL: https://issues.apache.org/jira/browse/STORM-3102 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka-client >Affects Versions: 2.0.0, 1.0.6, 1.1.3, 1.2.2 >Reporter: Andy Seidel >Priority: Major > > Recently I upgraded our storm topology to use the storm-kafka-client instead > of storm-kafka. After the upgrade in our production environment we saw a > significant (2x) reduction in our processing throughput. > We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 > server cluster. > After some investigation, it looks like the issue only occurs when using > kafka clients 0.11 or newer. > In kafka 0.11, the kafka consumer method commited always blocks to make an > external call o get the last commited offsets > [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] > In kafka 0.10.2 the kafka consumer only made the blocking remote call if the > partition is not assigned to the consumer > [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] > > The impact of this is to require every tuple to make blocking remote calls > before being emitted. > [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] > Removing this check returns performance to expected levels. > Looking through the storm-kafka-client code, it is not clear to me the impact > of ignoring the check. In our case we want at least once processing, but for > other processing gurantees the call to kafkaConsumer.commited(tp) is not > needed, as the value is only looked at if the processing mode is at least > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3102) Storm Kafka Client performance issues with Kafka Client v1.0.0
Andy Seidel created STORM-3102: -- Summary: Storm Kafka Client performance issues with Kafka Client v1.0.0 Key: STORM-3102 URL: https://issues.apache.org/jira/browse/STORM-3102 Project: Apache Storm Issue Type: Bug Components: storm-kafka-client Affects Versions: 1.1.3, 1.0.6 Reporter: Andy Seidel Recently I upgraded our storm topology to use the storm-kafka-client instead of storm-kafka. After the upgrade in our production environment we saw a significant (2x) reduction in our processing throughput. We process ~2 kafka messages per second, on a 10 machine kafka 1.0.0 server cluster. After some investigation, it looks like the issue only occurs when using kafka clients 0.11 or newer. In kafka 0.11, the kafka consumer method commited always blocks to make an external call o get the last commited offsets [https://github.com/apache/kafka/blob/e18335dd953107a61d89451932de33d33c0fd207/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1326-L1351] In kafka 0.10.2 the kafka consumer only made the blocking remote call if the partition is not assigned to the consumer [https://github.com/apache/kafka/blob/695596977c7f293513f255e07f5a4b0240a7595c/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1274-L1311] The impact of this is to require every tuple to make blocking remote calls before being emitted. [https://github.com/apache/storm/blob/2dc3d53a11aa3fea62190d1e44fa8b621466/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L464-L473] Removing this check returns performance to expected levels. Looking through the storm-kafka-client code, it is not clear to me the impact of ignoring the check. In our case we want at least once processing, but for other processing gurantees the call to kafkaConsumer.commited(tp) is not needed, as the value is only looked at if the processing mode is at least once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3101) Select Registry metrics by running daemon
[ https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated STORM-3101: -- Labels: pull-request-available (was: ) > Select Registry metrics by running daemon > - > > Key: STORM-3101 > URL: https://issues.apache.org/jira/browse/STORM-3101 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > > Metrics that are registered using StormMetricRegistry all added through > static method from the registry class, and attached to a singleton > MetricRegistry object per process. Currently most metrics are bound to > classes (static), so the issue occurs when metrics from irrelevant components > are accidentally registered in class initialization phase. > For example, a process running supervisor daemon will incorrectly register > metrics from nimbus when BasicContainer class is initialized and statically > imports > "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", > which triggers initialization of Nimbus class and all metrics registration, > even though no functionalities of nimbus daemon will be used and no nimbus > metrics will be updated. > This creates many garbage metrics and makes metrics hard to read. Therefore > we should filter metrics registration by the type of daemon that the process > actually runs. > For implementation please see the pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3094) Topology name needs to be validated at storm-client
[ https://issues.apache.org/jira/browse/STORM-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511194#comment-16511194 ] Manohar Vanam commented on STORM-3094: -- Thanks [~kabhwan], Raised pull request against 1.x-branch. > Topology name needs to be validated at storm-client > --- > > Key: STORM-3094 > URL: https://issues.apache.org/jira/browse/STORM-3094 > Project: Apache Storm > Issue Type: Improvement > Components: storm-client >Reporter: Manohar Vanam >Assignee: Manohar Vanam >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: patch.diff > > Time Spent: 1h 40m > Remaining Estimate: 0h > > *Current Behavior :* Execute topology with invalid topology name is throwing > exception after uploading the jar. > *Improvement :* Validating topology name at client side before uploading the > jar. > > > {code:java} > 2018-06-05 16:16:19.461 o.a.s.d.n.Nimbus pool-21-thread-53 [INFO] Uploading > file from client to > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.596 o.a.s.d.n.Nimbus pool-21-thread-35 [INFO] Finished > uploading file from client: > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] Received > topology submission for test-[123] (storm-2.0.0-SNAPSHOT JDK-1.8.0_162) with > conf {topology.users=[null], topology.acker.executors=null, > storm.zookeeper.superACL=null, topology.workers=3, > topology.submitter.principal=, topology.debug=true, topology.name=test-[123], > topology.kryo.register={}, storm.id=test-[123]-7-1528195580, > topology.kryo.decorators=[], topology.eventlogger.executors=0, > topology.submitter.user=mvanam, topology.max.task.parallelism=null} > 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] uploadedJar > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.624 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] > 'test-[123]-7-1528195580-stormjar.jar' does not appear to be valid. It must > match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string. > 2018-06-05 16:16:20.625 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] > 'test-[123]-7-1528195580-stormconf.ser' does not appear to be valid. It must > match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string. > 2018-06-05 16:16:20.626 o.a.s.d.n.Nimbus pool-21-thread-29 [WARN] Topology > submission exception. (topology name='test-[123]') > java.lang.IllegalArgumentException: test-[123]-7-1528195580-stormconf.ser > does not appear to be a valid blob key > at org.apache.storm.blobstore.BlobStore.validateKey(BlobStore.java:66) > ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]{code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (STORM-3094) Topology name needs to be validated at storm-client
[ https://issues.apache.org/jira/browse/STORM-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved STORM-3094. - Resolution: Fixed Fix Version/s: 2.0.0 Thanks [~ManoharVanam], I merged into master. If you would like to port this back to 1.x, please raise a pull request against 1.x-branch. > Topology name needs to be validated at storm-client > --- > > Key: STORM-3094 > URL: https://issues.apache.org/jira/browse/STORM-3094 > Project: Apache Storm > Issue Type: Improvement > Components: storm-client >Reporter: Manohar Vanam >Assignee: Manohar Vanam >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: patch.diff > > Time Spent: 1.5h > Remaining Estimate: 0h > > *Current Behavior :* Execute topology with invalid topology name is throwing > exception after uploading the jar. > *Improvement :* Validating topology name at client side before uploading the > jar. > > > {code:java} > 2018-06-05 16:16:19.461 o.a.s.d.n.Nimbus pool-21-thread-53 [INFO] Uploading > file from client to > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.596 o.a.s.d.n.Nimbus pool-21-thread-35 [INFO] Finished > uploading file from client: > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] Received > topology submission for test-[123] (storm-2.0.0-SNAPSHOT JDK-1.8.0_162) with > conf {topology.users=[null], topology.acker.executors=null, > storm.zookeeper.superACL=null, topology.workers=3, > topology.submitter.principal=, topology.debug=true, topology.name=test-[123], > topology.kryo.register={}, storm.id=test-[123]-7-1528195580, > topology.kryo.decorators=[], topology.eventlogger.executors=0, > topology.submitter.user=mvanam, topology.max.task.parallelism=null} > 2018-06-05 16:16:20.624 o.a.s.d.n.Nimbus pool-21-thread-29 [INFO] uploadedJar > /manu/Git/storm/storm-dist/binary/final-package/target/apache-storm-2.0.0-SNAPSHOT/storm-local/nimbus/inbox/stormjar-22979659-5176-46fd-9027-ffdde13f595a.jar > 2018-06-05 16:16:20.624 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] > 'test-[123]-7-1528195580-stormjar.jar' does not appear to be valid. It must > match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string. > 2018-06-05 16:16:20.625 o.a.s.b.BlobStore pool-21-thread-29 [ERROR] > 'test-[123]-7-1528195580-stormconf.ser' does not appear to be valid. It must > match ^[\w \t\._-]+$. And it can't be ".", "..", null or empty string. > 2018-06-05 16:16:20.626 o.a.s.d.n.Nimbus pool-21-thread-29 [WARN] Topology > submission exception. (topology name='test-[123]') > java.lang.IllegalArgumentException: test-[123]-7-1528195580-stormconf.ser > does not appear to be a valid blob key > at org.apache.storm.blobstore.BlobStore.validateKey(BlobStore.java:66) > ~[storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]{code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)