[jira] [Created] (STORM-3092) Metrics Reporter and Shutdown Hook on Supervisor not properly set up at launchDaemon
Zhengdai Hu created STORM-3092: -- Summary: Metrics Reporter and Shutdown Hook on Supervisor not properly set up at launchDaemon Key: STORM-3092 URL: https://issues.apache.org/jira/browse/STORM-3092 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 The bug was introduced in commit 0dac58b0aa82133df242b3b2ebeb65bfea7d63cc, when launchSupervisorThriftServer method is invoked in the launchDaemon method in Supervisor class. launchSupervisorThriftServer() invokes a blocking call to thrift server under the hood, hence preventing Utils.addShutdownHookWithForceKillIn1Sec and StormMetricsRegistry.startMetricsReporters from correctly called. The bug can be solved by moving launchSupervisorThriftServer to the end of the code block. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (STORM-3092) Metrics Reporter and Shutdown Hook on Supervisor not properly set up at launchDaemon
[ https://issues.apache.org/jira/browse/STORM-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu resolved STORM-3092. Resolution: Fixed > Metrics Reporter and Shutdown Hook on Supervisor not properly set up at > launchDaemon > > > Key: STORM-3092 > URL: https://issues.apache.org/jira/browse/STORM-3092 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Original Estimate: 24h > Time Spent: 10m > Remaining Estimate: 23h 50m > > The bug was introduced in commit 0dac58b0aa82133df242b3b2ebeb65bfea7d63cc, > when launchSupervisorThriftServer method is invoked in the launchDaemon > method in Supervisor class. launchSupervisorThriftServer() invokes a blocking > call to thrift server under the hood, hence preventing > Utils.addShutdownHookWithForceKillIn1Sec and > StormMetricsRegistry.startMetricsReporters from correctly called. > > The bug can be solved by moving launchSupervisorThriftServer to the end of > the code block. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3098) Fix bug in filterChangingBlobsFor() in Slot.java
Zhengdai Hu created STORM-3098: -- Summary: Fix bug in filterChangingBlobsFor() in Slot.java Key: STORM-3098 URL: https://issues.apache.org/jira/browse/STORM-3098 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 The following method is not implemented correctly {code:java} private static DynamicState filterChangingBlobsFor(DynamicState dynamicState, final LocalAssignment assignment) { if (!dynamicState.changingBlobs.isEmpty()) { return dynamicState; } HashSet savedBlobs = new HashSet<>(dynamicState.changingBlobs.size()); for (BlobChanging rc : dynamicState.changingBlobs) { if (forSameTopology(assignment, rc.assignment)) { savedBlobs.add(rc); } else { rc.latch.countDown(); } } return dynamicState.withChangingBlobs(savedBlobs); } {code} It doesn't modify dynamicState in anyway. The solution is to remove the negation in the first if statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (STORM-3098) Fix bug in filterChangingBlobsFor() in Slot.java
[ https://issues.apache.org/jira/browse/STORM-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu reassigned STORM-3098: -- Assignee: Zhengdai Hu > Fix bug in filterChangingBlobsFor() in Slot.java > > > Key: STORM-3098 > URL: https://issues.apache.org/jira/browse/STORM-3098 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > The following method is not implemented correctly > {code:java} > private static DynamicState filterChangingBlobsFor(DynamicState > dynamicState, final LocalAssignment assignment) { > if (!dynamicState.changingBlobs.isEmpty()) { > return dynamicState; > } > HashSet savedBlobs = new > HashSet<>(dynamicState.changingBlobs.size()); > for (BlobChanging rc : dynamicState.changingBlobs) { > if (forSameTopology(assignment, rc.assignment)) { > savedBlobs.add(rc); > } else { > rc.latch.countDown(); > } > } > return dynamicState.withChangingBlobs(savedBlobs); > } > {code} > It doesn't modify dynamicState in anyway. > The solution is to remove the negation in the first if statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3099) Extend metrics on supervisor and workers
Zhengdai Hu created STORM-3099: -- Summary: Extend metrics on supervisor and workers Key: STORM-3099 URL: https://issues.apache.org/jira/browse/STORM-3099 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to: Worker: # Kill Count by Category - Assignment Change/HB too old/Heap Space # Time spent in each state # Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker? # Time to start worker for topology from reading assignment for the first time. # Worker cleanup Time/Worker cleanup Retries # Worker Suicide Count - category: internal error or Assignment Change Supervisor: # Supervisor restart Count # Blobstore (Request to download time) # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish # Download rate individual blob (inside localizer) # Supervisor localizer thread blob download - how long (outside localizer) # Blobstore Update due to Version change Cnts # Blobstore Storage by users There might be more metrics added later. This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3099) Extend metrics on supervisor and workers
[ https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3099: --- Description: This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to: Worker: # Kill Count by Category - Assignment Change/HB too old/Heap Space # Time spent in each state # Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker? # Time to start worker for topology from reading assignment for the first time. # Worker cleanup Time/Worker cleanup Retries # Worker Suicide Count - category: internal error or Assignment Change Supervisor: # Supervisor restart Count # Blobstore (Request to download time) - # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish - # Download rate individual blob (inside localizer) - # Supervisor localizer thread blob download - how long (outside localizer) # Blobstore Update due to Version change Cnts # Blobstore Storage by users There might be more metrics added later. This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately. was: This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to: Worker: # Kill Count by Category - Assignment Change/HB too old/Heap Space # Time spent in each state # Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker? # Time to start worker for topology from reading assignment for the first time. # Worker cleanup Time/Worker cleanup Retries # Worker Suicide Count - category: internal error or Assignment Change Supervisor: # Supervisor restart Count # Blobstore (Request to download time) # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish # Download rate individual blob (inside localizer) # Supervisor localizer thread blob download - how long (outside localizer) # Blobstore Update due to Version change Cnts # Blobstore Storage by users There might be more metrics added later. This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately. > Extend metrics on supervisor and workers > > > Key: STORM-3099 > URL: https://issues.apache.org/jira/browse/STORM-3099 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > > This patch serves to extend metrics on supervisor and worker. Currently the > following metrics are being implemented, including but not limited to: > Worker: > # Kill Count by Category - Assignment Change/HB too old/Heap Space > # Time spent in each state > # Time to Actually Kill worker (from identifying need by supervisor and > actual change in the state of the worker) - per worker? > # Time to start worker for topology from reading assignment for the first > time. > # Worker cleanup Time/Worker cleanup Retries > # Worker Suicide Count - category: internal error or Assignment Change > Supervisor: > # Supervisor restart Count > # Blobstore (Request to download time) > - # Download time individual blob (inside localizer) localizer gettting > requst to actually download hdfs request to finish > - # Download rate individual blob (inside localizer) > - # Supervisor localizer thread blob download - how long (outside > localizer) > # Blobstore Update due to Version change Cnts > # Blobstore Storage by users > There might be more metrics added later. > This patch will also refactor code in relevant files. Bugs found during the > process will be reported in other issues and handled separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3101) Select Registry metrics by calling daemon
Zhengdai Hu created STORM-3101: -- Summary: Select Registry metrics by calling daemon Key: STORM-3101 URL: https://issues.apache.org/jira/browse/STORM-3101 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Metrics that are registered using StormMetricRegistry all added through static method from the registry class, and attached to a singleton MetricRegistry object per process. Currently most metrics are bound to classes (static), so the issue occurs when metrics from irrelevant components are accidentally registered in class initialization phase. For example, a process running supervisor daemon will incorrectly register metrics from nimbus when BasicContainer class is initialized and statically imports "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", which triggers initialization of Nimbus class and all metrics registration, even though no functionalities of nimbus daemon will be used and no nimbus metrics will be updated. This creates many garbage metrics and makes metrics hard to read. Therefore we should filter metrics registration by the type of daemon that the process actually runs. For implementation please see the pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3101) Select Registry metrics by running daemon
[ https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3101: --- Summary: Select Registry metrics by running daemon (was: Select Registry metrics by calling daemon) > Select Registry metrics by running daemon > - > > Key: STORM-3101 > URL: https://issues.apache.org/jira/browse/STORM-3101 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > Metrics that are registered using StormMetricRegistry all added through > static method from the registry class, and attached to a singleton > MetricRegistry object per process. Currently most metrics are bound to > classes (static), so the issue occurs when metrics from irrelevant components > are accidentally registered in class initialization phase. > For example, a process running supervisor daemon will incorrectly register > metrics from nimbus when BasicContainer class is initialized and statically > imports > "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", > which triggers initialization of Nimbus class and all metrics registration, > even though no functionalities of nimbus daemon will be used and no nimbus > metrics will be updated. > This creates many garbage metrics and makes metrics hard to read. Therefore > we should filter metrics registration by the type of daemon that the process > actually runs. > For implementation please see the pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3104) Delayed launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: There is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_UPDA
[jira] [Created] (STORM-3104) Delayed launch due to accidental transitioning in state machine
Zhengdai Hu created STORM-3104: -- Summary: Delayed launch due to accidental transitioning in state machine Key: STORM-3104 URL: https://issues.apache.org/jira/browse/STORM-3104 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 There is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:shell} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Summary: Delayed worker launch due to accidental transitioning in state machine (was: Delayed launch due to accidental transitioning in state machine) > Delayed worker launch due to accidental transitioning in state machine > -- > > Key: STORM-3104 > URL: https://issues.apache.org/jira/browse/STORM-3104 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > There is a comparison in > {code:java} > handleWaitingForBlobUpdate() > {code} > between dynamic state's current assignment and new assignment, which > accidentally route back state machine just transitioned from > WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is > because the current assignment in this case is highly likely to be null (I'm > not sure if it's guaranteed) and causes delay for a worker to start/restart. > The symptom is able to be reproduced by launching an empty supervisor and > submit any topology. Here's the log sample: > {code:sh} > 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6024 -> EMPTY msInState: 6024 > 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from > EMPTY to WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs > are [] > 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING > LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_en
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_U
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_B
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_UPD
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed) and causes delay for a worker to start/restart. The symptom is able to be reproduced by launching an empty supervisor and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample: {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... *2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false **2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... **2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE *2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample (relevant transition starting from 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]): {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2018-06-13 16:57:12.275 o.a.s.d.s.Sl
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Description: In Slot.java, there is a comparison in {code:java} handleWaitingForBlobUpdate() {code} between dynamic state's current assignment and new assignment, which accidentally route back state machine just transitioned from WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the current assignment in this case is highly likely to be null and different from new assignment (I'm not sure if it's guaranteed). This causes delay for a worker to start/restart. The symptom can be reproduced by launching an empty storm server and submit any topology. Here's the log sample (relevant transition starting from 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]): {code:sh} 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6024 -> EMPTY msInState: 6024 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY to WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs are [] 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE WAITING_FOR_BLOB_LOCALIZATION 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to pending... 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization LocalAssignment(topology_id:test-1-1528927024, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? false 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending changes, waiting for them to finish before launching container... 2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE 2
[jira] [Updated] (STORM-3101) Fix unexpected metrics registration in StormMetricsRegistry
[ https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3101: --- Summary: Fix unexpected metrics registration in StormMetricsRegistry (was: Select Registry metrics by running daemon) > Fix unexpected metrics registration in StormMetricsRegistry > --- > > Key: STORM-3101 > URL: https://issues.apache.org/jira/browse/STORM-3101 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Metrics that are registered using StormMetricRegistry all added through > static method from the registry class, and attached to a singleton > MetricRegistry object per process. Currently most metrics are bound to > classes (static), so the issue occurs when metrics from irrelevant components > are accidentally registered in class initialization phase. > For example, a process running supervisor daemon will incorrectly register > metrics from nimbus when BasicContainer class is initialized and statically > imports > "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", > which triggers initialization of Nimbus class and all metrics registration, > even though no functionalities of nimbus daemon will be used and no nimbus > metrics will be updated. > This creates many garbage metrics and makes metrics hard to read. Therefore > we should filter metrics registration by the type of daemon that the process > actually runs. > For implementation please see the pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers
[ https://issues.apache.org/jira/browse/STORM-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu reassigned STORM-3109: -- Assignee: Zhengdai Hu > Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers > - > > Key: STORM-3109 > URL: https://issues.apache.org/jira/browse/STORM-3109 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Affects Versions: 2.0.0, 1.1.0, 1.0.3, 1.x, 1.0.4, 1.1.1, 1.2.0, 1.1.2, > 1.0.5, 1.0.6, 1.2.1, 1.1.3, 1.2.2 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > When `STORM_LOCAL_DIR` is set as a relative path, the original implementation > incorrectly append the `STORM_LOCAL_DIR` after the current working directory > upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to > the incorrect location, so `storm kill_workers` can't actually kill workers > at all. > See pull request for implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers
Zhengdai Hu created STORM-3109: -- Summary: Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers Key: STORM-3109 URL: https://issues.apache.org/jira/browse/STORM-3109 Project: Apache Storm Issue Type: Bug Components: storm-core Affects Versions: 1.2.2, 1.1.3, 1.2.1, 1.0.6, 1.0.5, 1.1.2, 1.2.0, 1.1.1, 1.0.4, 1.0.3, 1.1.0, 2.0.0, 1.x Reporter: Zhengdai Hu Fix For: 2.0.0 When `STORM_LOCAL_DIR` is set as a relative path, the original implementation incorrectly append the `STORM_LOCAL_DIR` after the current working directory upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to the incorrect location, so `storm kill_workers` can't actually kill workers at all. See pull request for implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers
[ https://issues.apache.org/jira/browse/STORM-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3109: --- Description: When `STORM_LOCAL_DIR` is set as a relative path, the original implementation incorrectly append the `STORM_LOCAL_DIR` after the current working directory upon invocation of `storm kill_workers`. If the current working directory is not the home directory for storm, in this `STORM_LOCAL_DIR` points to the incorrect location, so `storm kill_workers` can't actually kill workers at all. See pull request for implementation. was: When `STORM_LOCAL_DIR` is set as a relative path, the original implementation incorrectly append the `STORM_LOCAL_DIR` after the current working directory upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to the incorrect location, so `storm kill_workers` can't actually kill workers at all. See pull request for implementation. > Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers > - > > Key: STORM-3109 > URL: https://issues.apache.org/jira/browse/STORM-3109 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Affects Versions: 2.0.0, 1.1.0, 1.0.3, 1.x, 1.0.4, 1.1.1, 1.2.0, 1.1.2, > 1.0.5, 1.0.6, 1.2.1, 1.1.3, 1.2.2 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Critical > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When `STORM_LOCAL_DIR` is set as a relative path, the original implementation > incorrectly append the `STORM_LOCAL_DIR` after the current working directory > upon invocation of `storm kill_workers`. If the current working directory is > not the home directory for storm, in this `STORM_LOCAL_DIR` points to the > incorrect location, so `storm kill_workers` can't actually kill workers at > all. > See pull request for implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine
[ https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3104: --- Priority: Major (was: Critical) > Delayed worker launch due to accidental transitioning in state machine > -- > > Key: STORM-3104 > URL: https://issues.apache.org/jira/browse/STORM-3104 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > In Slot.java, there is a comparison in > {code:java} > handleWaitingForBlobUpdate() > {code} > between dynamic state's current assignment and new assignment, which > accidentally route back state machine just transitioned from > WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the > current assignment in this case is highly likely to be null and different > from new assignment (I'm not sure if it's guaranteed). This causes delay for > a worker to start/restart. > The symptom can be reproduced by launching an empty storm server and submit > any topology. Here's the log sample (relevant transition starting from > 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]): > {code:sh} > 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6024 -> EMPTY msInState: 6024 > 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from > EMPTY to WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY > msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0 > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs > are [] > 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING > LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> > WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE > WAITING_FOR_BLOB_LOCALIZATION > 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs > [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_start:7, task_end:7), > ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, > task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, > cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, > resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, > cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to > pending... > 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization > LocalAssignment(topology_id:test-1-1528927024, > executors:[ExecutorInfo(task_start:10, task_end:10), > ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, > task_end:4), ExecutorInfo(task_sta
[jira] [Updated] (STORM-3099) Extend metrics on supervisor, workers, and DRPC
[ https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3099: --- Description: This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to: Worker: # Kill Count by Category - Assignment Change/HB too old/Heap Space # Time spent in each state # Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker? # Time to start worker for topology from reading assignment for the first time. # Worker cleanup Time/Worker cleanup Retries # Worker Suicide Count - category: internal error or Assignment Change Supervisor: # Supervisor restart Count # Blobstore (Request to download time) - # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish - # Download rate individual blob (inside localizer) - # Supervisor localizer thread blob download - how long (outside localizer) # Blobstore Update due to Version change Cnts # Blobstore Storage by users DRPC: # Avg/Max Time to respond to Http Request There might be more metrics added later. This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately. was: This patch serves to extend metrics on supervisor and worker. Currently the following metrics are being implemented, including but not limited to: Worker: # Kill Count by Category - Assignment Change/HB too old/Heap Space # Time spent in each state # Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - per worker? # Time to start worker for topology from reading assignment for the first time. # Worker cleanup Time/Worker cleanup Retries # Worker Suicide Count - category: internal error or Assignment Change Supervisor: # Supervisor restart Count # Blobstore (Request to download time) - # Download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish - # Download rate individual blob (inside localizer) - # Supervisor localizer thread blob download - how long (outside localizer) # Blobstore Update due to Version change Cnts # Blobstore Storage by users There might be more metrics added later. This patch will also refactor code in relevant files. Bugs found during the process will be reported in other issues and handled separately. > Extend metrics on supervisor, workers, and DRPC > --- > > Key: STORM-3099 > URL: https://issues.apache.org/jira/browse/STORM-3099 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > This patch serves to extend metrics on supervisor and worker. Currently the > following metrics are being implemented, including but not limited to: > Worker: > # Kill Count by Category - Assignment Change/HB too old/Heap Space > # Time spent in each state > # Time to Actually Kill worker (from identifying need by supervisor and > actual change in the state of the worker) - per worker? > # Time to start worker for topology from reading assignment for the first > time. > # Worker cleanup Time/Worker cleanup Retries > # Worker Suicide Count - category: internal error or Assignment Change > Supervisor: > # Supervisor restart Count > # Blobstore (Request to download time) > - # Download time individual blob (inside localizer) localizer gettting > requst to actually download hdfs request to finish > - # Download rate individual blob (inside localizer) > - # Supervisor localizer thread blob download - how long (outside > localizer) > # Blobstore Update due to Version change Cnts > # Blobstore Storage by users > DRPC: > # Avg/Max Time to respond to Http Request > There might be more metrics added later. > This patch will also refactor code in relevant files. Bugs found during the > process will be reported in other issues and handled separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3099) Extend metrics on supervisor, workers, and DRPC
[ https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3099: --- Summary: Extend metrics on supervisor, workers, and DRPC (was: Extend metrics on supervisor and workers) > Extend metrics on supervisor, workers, and DRPC > --- > > Key: STORM-3099 > URL: https://issues.apache.org/jira/browse/STORM-3099 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > This patch serves to extend metrics on supervisor and worker. Currently the > following metrics are being implemented, including but not limited to: > Worker: > # Kill Count by Category - Assignment Change/HB too old/Heap Space > # Time spent in each state > # Time to Actually Kill worker (from identifying need by supervisor and > actual change in the state of the worker) - per worker? > # Time to start worker for topology from reading assignment for the first > time. > # Worker cleanup Time/Worker cleanup Retries > # Worker Suicide Count - category: internal error or Assignment Change > Supervisor: > # Supervisor restart Count > # Blobstore (Request to download time) > - # Download time individual blob (inside localizer) localizer gettting > requst to actually download hdfs request to finish > - # Download rate individual blob (inside localizer) > - # Supervisor localizer thread blob download - how long (outside > localizer) > # Blobstore Update due to Version change Cnts > # Blobstore Storage by users > There might be more metrics added later. > This patch will also refactor code in relevant files. Bugs found during the > process will be reported in other issues and handled separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3125) Refactoring methods in Supervisor's component
Zhengdai Hu created STORM-3125: -- Summary: Refactoring methods in Supervisor's component Key: STORM-3125 URL: https://issues.apache.org/jira/browse/STORM-3125 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 This is a supplement issue page to STORM-3099, separating out refactoring work from metrics addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3125) Refactoring methods in Supervisor's components
[ https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3125: --- Summary: Refactoring methods in Supervisor's components (was: Refactoring methods in Supervisor's component) > Refactoring methods in Supervisor's components > -- > > Key: STORM-3125 > URL: https://issues.apache.org/jira/browse/STORM-3125 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > This is a supplement issue page to STORM-3099, separating out refactoring > work from metrics addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3125) Refactoring methods in components for Supervisor and DRPC
[ https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3125: --- Summary: Refactoring methods in components for Supervisor and DRPC (was: Refactoring methods in Supervisor's components) > Refactoring methods in components for Supervisor and DRPC > - > > Key: STORM-3125 > URL: https://issues.apache.org/jira/browse/STORM-3125 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > This is a supplement issue page to STORM-3099, separating out refactoring > work from metrics addition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3126) Avoid unnecessary force kill when invoking storm kill_workers
Zhengdai Hu created STORM-3126: -- Summary: Avoid unnecessary force kill when invoking storm kill_workers Key: STORM-3126 URL: https://issues.apache.org/jira/browse/STORM-3126 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Supervisor tries to force kill a worker before checking if it has died, leading to unnecessary force kill call. This is minor but does help clean up logs a little bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3127) Avoid potential race condition
Zhengdai Hu created STORM-3127: -- Summary: Avoid potential race condition Key: STORM-3127 URL: https://issues.apache.org/jira/browse/STORM-3127 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 PortAndAssignment and its call back is added after update to a blob is invoked asynchronously. It is not guaranteed that the new dependent worker will be registered before blob informs its update to listening workers. This can be fixed by moving addReference call up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3128) Connection refused error in AsyncLocalizerTest
Zhengdai Hu created STORM-3128: -- Summary: Connection refused error in AsyncLocalizerTest Key: STORM-3128 URL: https://issues.apache.org/jira/browse/STORM-3128 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created and tries but failed to connect to zookeeper due to connection error. I'm not sure if this compromises the test even though it is passed after connection retry timeout. But it's nice to keep in mind. {noformat} 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) 2018-06-27 13:05:28.032 [main] INFO org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_171] at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525364#comment-16525364 ] Zhengdai Hu commented on STORM-3128: This issue is discovered when I tried to refactor the test > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Minor > Fix For: 2.0.0 > > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525364#comment-16525364 ] Zhengdai Hu edited comment on STORM-3128 at 6/27/18 6:17 PM: - I discovered the issue when trying to refactor the test was (Author: zhengdai): This issue is discovered when I tried to refactor the test > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Minor > Fix For: 2.0.0 > > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3125) Refactoring methods in components for Supervisor and DRPC
[ https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3125: --- Description: This is a supplement issue page to STORM-3099, separating out refactoring work from metrics addition. A few misc bug discovered during refactoring have been incorporate in this issue as well. See links for more information. was:This is a supplement issue page to STORM-3099, separating out refactoring work from metrics addition. > Refactoring methods in components for Supervisor and DRPC > - > > Key: STORM-3125 > URL: https://issues.apache.org/jira/browse/STORM-3125 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > This is a supplement issue page to STORM-3099, separating out refactoring > work from metrics addition. > A few misc bug discovered during refactoring have been incorporate in this > issue as well. See links for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3129) Worker state machine does not use correct time util to get start time
Zhengdai Hu created STORM-3129: -- Summary: Worker state machine does not use correct time util to get start time Key: STORM-3129 URL: https://issues.apache.org/jira/browse/STORM-3129 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Current implementation uses System.currentTimeMillis() instead of Time.currentTimeMillis() to get state start time. This may create problem in unit test as it uses simulated time controlled by Storm Time util. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3130) Add Timer registration and Timed object wrapper to Storm metrics util.
Zhengdai Hu created STORM-3130: -- Summary: Add Timer registration and Timed object wrapper to Storm metrics util. Key: STORM-3130 URL: https://issues.apache.org/jira/browse/STORM-3130 Project: Apache Storm Issue Type: New Feature Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 This allows us to time method running duration or variable/resource lifespan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3133) Extend metrics on Nimbus and LogViewer
Zhengdai Hu created STORM-3133: -- Summary: Extend metrics on Nimbus and LogViewer Key: STORM-3133 URL: https://issues.apache.org/jira/browse/STORM-3133 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Include but not limited to Logviewer 1. Clean-up time 2. Time to complete one clean up loop Time. 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?) 4. Failures/exceptions. 5. Search request Cnt: By category - Archived/non-archived 6. Search Request - Response time 7. Search Request - 0 result Cnt 8. Search Result - open files 9. File partial read count 10. File Download request Cnt/ And Size served 11. Disk IO by logviewer 12. CPU usage ( for unzipping files) Nimbus Additional: - Topology stormjar.ser/stormconf.ser/stormser.ser file upload time. - Scheduler related metrics would be a long list generic and specific to different strategies. - Most if not all cluster summary can be pushed as Metrics. - Restart cnt - Nimbus loss of leadership(?) - UI not responding (https://jira.ouroath.com/browse/YSTORM-4838) - Negative resource scheduling events (https://jira.ouroath.com/browse/YSTORM-4940) - Excessive scheduling time (?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3140) Duplicated method?
Zhengdai Hu created STORM-3140: -- Summary: Duplicated method? Key: STORM-3140 URL: https://issues.apache.org/jira/browse/STORM-3140 Project: Apache Storm Issue Type: Bug Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu {code:java} /** * Handles '/searchLogs' request. */ @GET @Path("/searchLogs") public Response searchLogs(@Context HttpServletRequest request) throws IOException { String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); } /** * Handles '/listLogs' request. */ @GET @Path("/listLogs") public Response listLogs(@Context HttpServletRequest request) throws IOException { meterListLogsHttpRequests.mark(); String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); }{code} These two methods have identical although they seem to serve different functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3140) Duplicated method in Logviewer REST API?
[ https://issues.apache.org/jira/browse/STORM-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3140: --- Description: {code:java} /** * Handles '/searchLogs' request. */ @GET @Path("/searchLogs") public Response searchLogs(@Context HttpServletRequest request) throws IOException { String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); } /** * Handles '/listLogs' request. */ @GET @Path("/listLogs") public Response listLogs(@Context HttpServletRequest request) throws IOException { meterListLogsHttpRequests.mark(); String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); }{code} These two methods are identical although they seem to serve different functions. was: {code:java} /** * Handles '/searchLogs' request. */ @GET @Path("/searchLogs") public Response searchLogs(@Context HttpServletRequest request) throws IOException { String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); } /** * Handles '/listLogs' request. */ @GET @Path("/listLogs") public Response listLogs(@Context HttpServletRequest request) throws IOException { meterListLogsHttpRequests.mark(); String user = httpCredsHandler.getUserName(request); String topologyId = request.getParameter("topoId"); String portStr = request.getParameter("port"); String callback = request.getParameter("callback"); String origin = request.getHeader("Origin"); return logviewer.listLogFiles(user, portStr != null ? Integer.parseInt(portStr) : null, topologyId, callback, origin); }{code} These two methods have identical although they seem to serve different functions. > Duplicated method in Logviewer REST API? > > > Key: STORM-3140 > URL: https://issues.apache.org/jira/browse/STORM-3140 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > {code:java} > /** > * Handles '/searchLogs' request. > */ > @GET > @Path("/searchLogs") > public Response searchLogs(@Context HttpServletRequest request) throws > IOException { > String user = httpCredsHandler.getUserName(request); > String topologyId = request.getParameter("topoId"); > String portStr = request.getParameter("port"); > String callback = request.getParameter("callback"); > String origin = request.getHeader("Origin"); > return logviewer.listLogFiles(user, portStr != null ? > Integer.parseInt(portStr) : null, topologyId, callback, origin); > } > /** > * Handles '/listLogs' request. > */ > @GET > @Path("/listLogs") > public Response listLogs(@Context HttpServletRequest request) throws > IOException { > meterListLogsHttpRequests.mark(); > String user = httpCredsHandler.getUserName(request); > String topologyId = request.getParameter("topoId"); > String portStr = request.getParameter("port"); > String callback = request.getParameter("callback"); > String origin = request.getHeader("Origin"); > return logviewer.listLogFiles(user, portStr != null ? > Integer.parseInt(portStr) : null, topologyId, callback, origin); > }{code} > These two methods are identical although they seem to serve different > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3140) Duplicated method in Logviewer REST API?
[ https://issues.apache.org/jira/browse/STORM-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3140: --- Summary: Duplicated method in Logviewer REST API? (was: Duplicated method?) > Duplicated method in Logviewer REST API? > > > Key: STORM-3140 > URL: https://issues.apache.org/jira/browse/STORM-3140 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > {code:java} > /** > * Handles '/searchLogs' request. > */ > @GET > @Path("/searchLogs") > public Response searchLogs(@Context HttpServletRequest request) throws > IOException { > String user = httpCredsHandler.getUserName(request); > String topologyId = request.getParameter("topoId"); > String portStr = request.getParameter("port"); > String callback = request.getParameter("callback"); > String origin = request.getHeader("Origin"); > return logviewer.listLogFiles(user, portStr != null ? > Integer.parseInt(portStr) : null, topologyId, callback, origin); > } > /** > * Handles '/listLogs' request. > */ > @GET > @Path("/listLogs") > public Response listLogs(@Context HttpServletRequest request) throws > IOException { > meterListLogsHttpRequests.mark(); > String user = httpCredsHandler.getUserName(request); > String topologyId = request.getParameter("topoId"); > String portStr = request.getParameter("port"); > String callback = request.getParameter("callback"); > String origin = request.getHeader("Origin"); > return logviewer.listLogFiles(user, portStr != null ? > Integer.parseInt(portStr) : null, topologyId, callback, origin); > }{code} > These two methods have identical although they seem to serve different > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3143) Unnecessary inclusion of empty match result in Json
[ https://issues.apache.org/jira/browse/STORM-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3143: --- Summary: Unnecessary inclusion of empty match result in Json (was: Unnecessary inclusion of empty string match result in Json) > Unnecessary inclusion of empty match result in Json > --- > > Key: STORM-3143 > URL: https://issues.apache.org/jira/browse/STORM-3143 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > `FindNMatches()` didn't correctly filter out empty match result in > `substringSearch()` and hence send back an empty map to user. I don't know if > this the desired behavior but a fix to current behavior will make metrics for > logviewer easier to implement. > An example of current behavior: > {code:json} > { > "fileOffset": 1, > "searchString": "sdf", > "matches": [ > { > "searchString": "sdf", > "fileName": "word-count-1-1530815972/6701/worker.log", > "matches": [], > "port": "6701", > "isDaemon": "no", > "startByteOffset": 0 > } > ] > } > {code} > Desired behavior: > {code:json} > { > "fileOffset": 1, > "searchString": "sdf", > "matches": [] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3143) Unnecessary inclusion of empty string match result in Json
Zhengdai Hu created STORM-3143: -- Summary: Unnecessary inclusion of empty string match result in Json Key: STORM-3143 URL: https://issues.apache.org/jira/browse/STORM-3143 Project: Apache Storm Issue Type: Bug Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 `FindNMatches()` didn't correctly filter out empty match result in `substringSearch()` and hence send back an empty map to user. I don't know if this the desired behavior but a fix to current behavior will make metrics for logviewer easier to implement. An example of current behavior: {code:json} { "fileOffset": 1, "searchString": "sdf", "matches": [ { "searchString": "sdf", "fileName": "word-count-1-1530815972/6701/worker.log", "matches": [], "port": "6701", "isDaemon": "no", "startByteOffset": 0 } ] } {code} Desired behavior: {code:json} { "fileOffset": 1, "searchString": "sdf", "matches": [] } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3144) Extend metrics on Nimbus
Zhengdai Hu created STORM-3144: -- Summary: Extend metrics on Nimbus Key: STORM-3144 URL: https://issues.apache.org/jira/browse/STORM-3144 Project: Apache Storm Issue Type: Improvement Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Metrics include: # File upload time # Nimbus restart count # Nimbus loss of leadership: meter marking when a nimbus node gains or loses leadership # Excessive scheduling time (both duration distribution and current longest) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3133) Extend metrics on Nimbus and LogViewer
[ https://issues.apache.org/jira/browse/STORM-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3133: --- Description: Include but not limited to Logviewer 1. Clean-up time 2. Time to complete one clean up loop Time. 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?) 4. Failures/exceptions. 5. Search request Cnt: By category - Archived/non-archived 6. Search Request - Response time 7. Search Request - 0 result Cnt 8. Search Result - open files 9. File partial read count 10. File Download request Cnt/ And Size served 11. Disk IO by logviewer 12. CPU usage ( for unzipping files) -Nimbus Additional:- - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.- - -Scheduler related metrics would be a long list generic and specific to different strategies.- - -Most if not all cluster summary can be pushed as Metrics.- - -Restart cnt- - -Nimbus loss of leadership(?)- - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])- - -Negative resource scheduling events ([https://jira.ouroath.com/browse/YSTORM-4940])- - -Excessive scheduling time (?)- Nimbus metrics have been moved to another Jira page #3144 was: Include but not limited to Logviewer 1. Clean-up time 2. Time to complete one clean up loop Time. 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?) 4. Failures/exceptions. 5. Search request Cnt: By category - Archived/non-archived 6. Search Request - Response time 7. Search Request - 0 result Cnt 8. Search Result - open files 9. File partial read count 10. File Download request Cnt/ And Size served 11. Disk IO by logviewer 12. CPU usage ( for unzipping files) Nimbus Additional: - Topology stormjar.ser/stormconf.ser/stormser.ser file upload time. - Scheduler related metrics would be a long list generic and specific to different strategies. - Most if not all cluster summary can be pushed as Metrics. - Restart cnt - Nimbus loss of leadership(?) - UI not responding (https://jira.ouroath.com/browse/YSTORM-4838) - Negative resource scheduling events (https://jira.ouroath.com/browse/YSTORM-4940) - Excessive scheduling time (?) > Extend metrics on Nimbus and LogViewer > -- > > Key: STORM-3133 > URL: https://issues.apache.org/jira/browse/STORM-3133 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Include but not limited to > Logviewer > 1. Clean-up time > 2. Time to complete one clean up loop Time. > 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like > GC.?) > 4. Failures/exceptions. > 5. Search request Cnt: By category - Archived/non-archived > 6. Search Request - Response time > 7. Search Request - 0 result Cnt > 8. Search Result - open files > 9. File partial read count > 10. File Download request Cnt/ And Size served > 11. Disk IO by logviewer > 12. CPU usage ( for unzipping files) > -Nimbus Additional:- > - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.- > - -Scheduler related metrics would be a long list generic and specific to > different strategies.- > - -Most if not all cluster summary can be pushed as Metrics.- > - -Restart cnt- > - -Nimbus loss of leadership(?)- > - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])- > - -Negative resource scheduling events > ([https://jira.ouroath.com/browse/YSTORM-4940])- > - -Excessive scheduling time (?)- > > Nimbus metrics have been moved to another Jira page #3144 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3133) Extend metrics on Nimbus and LogViewer
[ https://issues.apache.org/jira/browse/STORM-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3133: --- Description: Include but not limited to Logviewer 1. Clean-up time 2. Time to complete one clean up loop Time. 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?) 4. Failures/exceptions. 5. Search request Cnt: By category - Archived/non-archived 6. Search Request - Response time 7. Search Request - 0 result Cnt 8. Search Result - open files 9. File partial read count 10. File Download request Cnt/ And Size served 11. Disk IO by logviewer 12. CPU usage ( for unzipping files) Nimbus Additional: * Topology stormjar.ser/stormconf.ser/stormser.ser file upload time. * Scheduler related metrics would be a long list generic and specific to different strategies. * Most if not all cluster summary can be pushed as Metrics. * Restart cnt * Nimbus loss of leadership !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle! * UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838]) * Negative resource scheduling events ([https://jira.ouroath.com/browse/YSTORM-4940]) * Excessive scheduling time !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle! was: Include but not limited to Logviewer 1. Clean-up time 2. Time to complete one clean up loop Time. 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?) 4. Failures/exceptions. 5. Search request Cnt: By category - Archived/non-archived 6. Search Request - Response time 7. Search Request - 0 result Cnt 8. Search Result - open files 9. File partial read count 10. File Download request Cnt/ And Size served 11. Disk IO by logviewer 12. CPU usage ( for unzipping files) -Nimbus Additional:- - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.- - -Scheduler related metrics would be a long list generic and specific to different strategies.- - -Most if not all cluster summary can be pushed as Metrics.- - -Restart cnt- - -Nimbus loss of leadership(?)- - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])- - -Negative resource scheduling events ([https://jira.ouroath.com/browse/YSTORM-4940])- - -Excessive scheduling time (?)- Nimbus metrics have been moved to another Jira page #3144 > Extend metrics on Nimbus and LogViewer > -- > > Key: STORM-3133 > URL: https://issues.apache.org/jira/browse/STORM-3133 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Include but not limited to > Logviewer > 1. Clean-up time > 2. Time to complete one clean up loop Time. > 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like > GC.?) > 4. Failures/exceptions. > 5. Search request Cnt: By category - Archived/non-archived > 6. Search Request - Response time > 7. Search Request - 0 result Cnt > 8. Search Result - open files > 9. File partial read count > 10. File Download request Cnt/ And Size served > 11. Disk IO by logviewer > 12. CPU usage ( for unzipping files) > Nimbus Additional: > * Topology stormjar.ser/stormconf.ser/stormser.ser file upload time. > * Scheduler related metrics would be a long list generic and specific to > different strategies. > * Most if not all cluster summary can be pushed as Metrics. > * Restart cnt > * Nimbus loss of leadership > !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle! > * UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838]) > * Negative resource scheduling events > ([https://jira.ouroath.com/browse/YSTORM-4940]) > * Excessive scheduling time > !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (STORM-3144) Extend metrics on Nimbus
[ https://issues.apache.org/jira/browse/STORM-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu closed STORM-3144. -- Resolution: Duplicate > Extend metrics on Nimbus > > > Key: STORM-3144 > URL: https://issues.apache.org/jira/browse/STORM-3144 > Project: Apache Storm > Issue Type: Improvement > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > Metrics include: > # File upload time > # Nimbus restart count > # Nimbus loss of leadership: meter marking when a nimbus node gains or loses > leadership > # Excessive scheduling time (both duration distribution and current longest) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3147) Port ClusterSummary as Metrics to StormMetricsRegistry
Zhengdai Hu created STORM-3147: -- Summary: Port ClusterSummary as Metrics to StormMetricsRegistry Key: STORM-3147 URL: https://issues.apache.org/jira/browse/STORM-3147 Project: Apache Storm Issue Type: New Feature Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3150) Improve Gauge Registration in StormMetricsRegistry
Zhengdai Hu created STORM-3150: -- Summary: Improve Gauge Registration in StormMetricsRegistry Key: STORM-3150 URL: https://issues.apache.org/jira/browse/STORM-3150 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Make #registerGauge and #registerProvidedGauge generic and clean up other code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3151) Negative Scheduling Resource/Overscheduling issue
Zhengdai Hu created STORM-3151: -- Summary: Negative Scheduling Resource/Overscheduling issue Key: STORM-3151 URL: https://issues.apache.org/jira/browse/STORM-3151 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 Possible overscheduling captured when follow steps are performed 1) launch nimbus & zookeeper 2) launch supervisor 1 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) 4) launch supervisor 2 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) {noformat} 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 {noformat} Indicating there may be issues inside scheduler. It is discovered when I ported ClusterSummay to StormMetrics -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3151) Negative Scheduling Resource/Overscheduling issue
[ https://issues.apache.org/jira/browse/STORM-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3151: --- Description: Possible overscheduling captured when follow steps are performed 1) launch nimbus & zookeeper 2) launch supervisor 1 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) 4) launch supervisor 2 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) {noformat} 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 {noformat} Indicating there may be issues inside scheduler. It is discovered when I ported ClusterSummay to StormMetrics was: Possible overscheduling captured when follow steps are performed 1) launch nimbus & zookeeper 2) launch supervisor 1 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) 4) launch supervisor 2 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) {noformat} 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 {noformat} Indicating there may be issues inside scheduler. It is discovered when I ported ClusterSummay to StormMetrics > Negative Scheduling Resource/Overscheduling issue > - > > Key: STORM-3151 > URL: https://issues.apache.org/jira/browse/STORM-3151 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > Possible overscheduling captured when follow steps are performed > 1) launch nimbus & zookeeper > 2) launch supervisor 1 > 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) > 4) launch supervisor 2 > 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) > {noformat} > 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled > on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 > {noformat} > Indicating there may be issues inside scheduler. > It is discovered when I ported ClusterSummay to StormMetrics -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3151) Negative Scheduling Resource/Overscheduling issue
[ https://issues.apache.org/jira/browse/STORM-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3151: --- Description: Possible overscheduling captured when follow steps are performed (Logging is added in STORM-3147) 1) launch nimbus & zookeeper 2) launch supervisor 1 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) 4) launch supervisor 2 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) {noformat} 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 {noformat} Indicating there may be issues inside scheduler. It is discovered when I ported ClusterSummay to StormMetrics was: Possible overscheduling captured when follow steps are performed 1) launch nimbus & zookeeper 2) launch supervisor 1 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) 4) launch supervisor 2 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) {noformat} 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 {noformat} Indicating there may be issues inside scheduler. It is discovered when I ported ClusterSummay to StormMetrics > Negative Scheduling Resource/Overscheduling issue > - > > Key: STORM-3151 > URL: https://issues.apache.org/jira/browse/STORM-3151 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > Possible overscheduling captured when follow steps are performed (Logging is > added in STORM-3147) > 1) launch nimbus & zookeeper > 2) launch supervisor 1 > 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology) > 4) launch supervisor 2 > 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology) > {noformat} > 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled > on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212 > {noformat} > Indicating there may be issues inside scheduler. > It is discovered when I ported ClusterSummay to StormMetrics -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3157) General improvement to StormMetricsRegistry
Zhengdai Hu created STORM-3157: -- Summary: General improvement to StormMetricsRegistry Key: STORM-3157 URL: https://issues.apache.org/jira/browse/STORM-3157 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 The solution contains general improvement and clean up to StormMetricsRegistry. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3157) General improvement to StormMetricsRegistry
[ https://issues.apache.org/jira/browse/STORM-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3157: --- Description: The solution contains general improvement and clean up to StormMetricsRegistry. Therefore this may affect all current and future changes to server-side metrics (was: The solution contains general improvement and clean up to StormMetricsRegistry.) > General improvement to StormMetricsRegistry > --- > > Key: STORM-3157 > URL: https://issues.apache.org/jira/browse/STORM-3157 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > The solution contains general improvement and clean up to > StormMetricsRegistry. Therefore this may affect all current and future > changes to server-side metrics -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3159) Fixed potential file resource leak
Zhengdai Hu created STORM-3159: -- Summary: Fixed potential file resource leak Key: STORM-3159 URL: https://issues.apache.org/jira/browse/STORM-3159 Project: Apache Storm Issue Type: Bug Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 `zipFileSize()` in ServerUtils is not correctly wrapped in try-with-resource block, which could lead to resource leak. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3162) Race condition at updateHeartbeatCache
Zhengdai Hu created STORM-3162: -- Summary: Race condition at updateHeartbeatCache Key: STORM-3162 URL: https://issues.apache.org/jira/browse/STORM-3162 Project: Apache Storm Issue Type: Bug Components: storm-client, storm-core, storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 This is discovered during testing for STORM-3133. Travis-CI log can be found [here|https://travis-ci.org/apache/storm/jobs/408719153]. Specifically, updateHeartbeatCache can be invoked both by Nimbus (at `Nimbus#updateHeartBeats`) and by Supervisor (at `Nimbubs#updateCachedHeartbeatsFromWorker` at `Nimbus#updateCachedHeartbeatsFromSupervisor`), causing ConcurrentModificationException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3162) Race condition at updateHeartbeatCache
[ https://issues.apache.org/jira/browse/STORM-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3162: --- Description: This is discovered during testing for STORM-3133. Travis-CI log can be found [here|https://travis-ci.org/apache/storm/jobs/408719153#L1897]. Specifically, updateHeartbeatCache can be invoked both by Nimbus (at `Nimbus#updateHeartBeats`) and by Supervisor (at `Nimbubs#updateCachedHeartbeatsFromWorker` at `Nimbus#updateCachedHeartbeatsFromSupervisor`), causing ConcurrentModificationException. was: This is discovered during testing for STORM-3133. Travis-CI log can be found [here|https://travis-ci.org/apache/storm/jobs/408719153]. Specifically, updateHeartbeatCache can be invoked both by Nimbus (at `Nimbus#updateHeartBeats`) and by Supervisor (at `Nimbubs#updateCachedHeartbeatsFromWorker` at `Nimbus#updateCachedHeartbeatsFromSupervisor`), causing ConcurrentModificationException. > Race condition at updateHeartbeatCache > -- > > Key: STORM-3162 > URL: https://issues.apache.org/jira/browse/STORM-3162 > Project: Apache Storm > Issue Type: Bug > Components: storm-client, storm-core, storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Critical > Fix For: 2.0.0 > > > This is discovered during testing for STORM-3133. Travis-CI log can be found > [here|https://travis-ci.org/apache/storm/jobs/408719153#L1897]. > Specifically, updateHeartbeatCache can be invoked both by Nimbus (at > `Nimbus#updateHeartBeats`) and by Supervisor (at > `Nimbubs#updateCachedHeartbeatsFromWorker` at > `Nimbus#updateCachedHeartbeatsFromSupervisor`), causing > ConcurrentModificationException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3169) Misleading logviewer.cleanup.age.min
Zhengdai Hu created STORM-3169: -- Summary: Misleading logviewer.cleanup.age.min Key: STORM-3169 URL: https://issues.apache.org/jira/browse/STORM-3169 Project: Apache Storm Issue Type: Bug Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 Config specification logviewer.cleanup.age.min labels the duration in minutes passed since a log file is modified before we consider the log to be old. However in the actual use it's been subtracted by nowMills, which is the current time in milliseconds. We should convert the minutes to millisecond for it to function correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files
Zhengdai Hu created STORM-3170: -- Summary: DirectoryCleaner may not correctly report correct number of deleted files Key: STORM-3170 URL: https://issues.apache.org/jira/browse/STORM-3170 Project: Apache Storm Issue Type: Bug Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 The original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This invalidate any metrics built on top of this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files
[ https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3170: --- Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This invalidate any metrics built on top of this. (was: The original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This invalidate any metrics built on top of this.) > DirectoryCleaner may not correctly report correct number of deleted files > - > > Key: STORM-3170 > URL: https://issues.apache.org/jira/browse/STORM-3170 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation > calls file#delete without checking if it succeed or not, even though they're > always reported as deleted. This invalidate any metrics built on top of this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3169) Misleading logviewer.cleanup.age.min
[ https://issues.apache.org/jira/browse/STORM-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3169: --- Description: Config specification logviewer.cleanup.age.min labels the duration in minutes passed since a log file is modified before we consider the log to be old. However in the actual use it's been subtracted by nowMills, which is the current time in milliseconds. We should convert it to milliseconds. (was: Config specification logviewer.cleanup.age.min labels the duration in minutes passed since a log file is modified before we consider the log to be old. However in the actual use it's been subtracted by nowMills, which is the current time in milliseconds. We should convert the minutes to millisecond for it to function correctly.) > Misleading logviewer.cleanup.age.min > > > Key: STORM-3169 > URL: https://issues.apache.org/jira/browse/STORM-3169 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Config specification logviewer.cleanup.age.min labels the duration in minutes > passed since a log file is modified before we consider the log to be old. > However in the actual use it's been subtracted by nowMills, which is the > current time in milliseconds. We should convert it to milliseconds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files
[ https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3170: --- Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This prevents DirectoryCleaner from clean up other files and invalidates any metrics built on top of this. (was: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This invalidate any metrics built on top of this.) > DirectoryCleaner may not correctly report correct number of deleted files > - > > Key: STORM-3170 > URL: https://issues.apache.org/jira/browse/STORM-3170 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Fix For: 2.0.0 > > > In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation > calls file#delete without checking if it succeed or not, even though they're > always reported as deleted. This prevents DirectoryCleaner from clean up > other files and invalidates any metrics built on top of this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files
[ https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3170: --- Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeeds or not, and they're always reported as deleted. This prevents DirectoryCleaner from clean up other files and invalidates any metrics built on top of this. (was: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeeds or not, even though they're always reported as deleted. This prevents DirectoryCleaner from clean up other files and invalidates any metrics built on top of this.) > DirectoryCleaner may not correctly report correct number of deleted files > - > > Key: STORM-3170 > URL: https://issues.apache.org/jira/browse/STORM-3170 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation > calls file#delete without checking if it succeeds or not, and they're always > reported as deleted. This prevents DirectoryCleaner from clean up other files > and invalidates any metrics built on top of this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files
[ https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3170: --- Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeeds or not, even though they're always reported as deleted. This prevents DirectoryCleaner from clean up other files and invalidates any metrics built on top of this. (was: In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls file#delete without checking if it succeed or not, even though they're always reported as deleted. This prevents DirectoryCleaner from clean up other files and invalidates any metrics built on top of this.) > DirectoryCleaner may not correctly report correct number of deleted files > - > > Key: STORM-3170 > URL: https://issues.apache.org/jira/browse/STORM-3170 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation > calls file#delete without checking if it succeeds or not, even though they're > always reported as deleted. This prevents DirectoryCleaner from clean up > other files and invalidates any metrics built on top of this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3173) flush metrics to Yamas on shutdown
Zhengdai Hu created STORM-3173: -- Summary: flush metrics to Yamas on shutdown Key: STORM-3173 URL: https://issues.apache.org/jira/browse/STORM-3173 Project: Apache Storm Issue Type: Improvement Components: storm-server, storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Assignee: Zhengdai Hu Fix For: 2.0.0 We lose shutdown related metrics that we should alert on at shutdown. We should flush metrics on a shutdown. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3173) flush metrics to Yamas on shutdown
[ https://issues.apache.org/jira/browse/STORM-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3173: --- Description: We lose shutdown related metrics that we should alert on at shutdown. We should flush metrics on a shutdown. https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497 was: We lose shutdown related metrics that we should alert on at shutdown. We should flush metrics on a shutdown. > flush metrics to Yamas on shutdown > -- > > Key: STORM-3173 > URL: https://issues.apache.org/jira/browse/STORM-3173 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server, storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Minor > Fix For: 2.0.0 > > > We lose shutdown related metrics that we should alert on at shutdown. We > should flush metrics on a shutdown. > https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3173) flush metrics to ScheduledReporter on shutdown
[ https://issues.apache.org/jira/browse/STORM-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3173: --- Summary: flush metrics to ScheduledReporter on shutdown (was: flush metrics to Yamas on shutdown) > flush metrics to ScheduledReporter on shutdown > -- > > Key: STORM-3173 > URL: https://issues.apache.org/jira/browse/STORM-3173 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server, storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Assignee: Zhengdai Hu >Priority: Minor > Fix For: 2.0.0 > > > We lose shutdown related metrics that we should alert on at shutdown. We > should flush metrics on a shutdown. > https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3177) MockRemovableFile returns true on `#exists` even after `#delete` is called.
Zhengdai Hu created STORM-3177: -- Summary: MockRemovableFile returns true on `#exists` even after `#delete` is called. Key: STORM-3177 URL: https://issues.apache.org/jira/browse/STORM-3177 Project: Apache Storm Issue Type: Bug Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Fix For: 2.0.0 See conversation in https://github.com/apache/storm/pull/2788#pullrequestreview-142918985 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3178) Decouple `ClientSupervisorUtils` and refactor metrics registration
Zhengdai Hu created STORM-3178: -- Summary: Decouple `ClientSupervisorUtils` and refactor metrics registration Key: STORM-3178 URL: https://issues.apache.org/jira/browse/STORM-3178 Project: Apache Storm Issue Type: Improvement Components: storm-client, storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu See conversation https://github.com/apache/storm/pull/2710#discussion_r207576736 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3177) MockRemovableFile returns true on `#exists` even after `#delete` is called.
[ https://issues.apache.org/jira/browse/STORM-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3177: --- Fix Version/s: (was: 2.0.0) > MockRemovableFile returns true on `#exists` even after `#delete` is called. > --- > > Key: STORM-3177 > URL: https://issues.apache.org/jira/browse/STORM-3177 > Project: Apache Storm > Issue Type: Bug > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Minor > > See conversation in > https://github.com/apache/storm/pull/2788#pullrequestreview-142918985 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3128: --- Fix Version/s: (was: 2.0.0) > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Minor > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3128: --- Priority: Major (was: Minor) > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572180#comment-16572180 ] Zhengdai Hu commented on STORM-3128: It looks like that not all Zookeeper calls are stubbed correctly. Exception are thrown when LocalFsBlobStore#prepare gets called, which calls BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then issue connection to zookeeper under the hood. I guess that the exception is suppressed implicates that this is a common issue, but may be hard to fix. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572180#comment-16572180 ] Zhengdai Hu edited comment on STORM-3128 at 8/7/18 7:10 PM: It looks like that not all Zookeeper calls are stubbed correctly. Exception are thrown when LocalFsBlobStore#prepare gets called, which calls BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then issue connection to zookeeper under the hood. I guess that the exception is suppressed implicates that this is a common issue, but may be hard to fix. [~Srdo] was (Author: zhengdai): It looks like that not all Zookeeper calls are stubbed correctly. Exception are thrown when LocalFsBlobStore#prepare gets called, which calls BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then issue connection to zookeeper under the hood. I guess that the exception is suppressed implicates that this is a common issue, but may be hard to fix. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572193#comment-16572193 ] Zhengdai Hu commented on STORM-3128: It doesn't, but it's likely to cause a timeout error. If I remembered correctly, StormTimer terminates the process in case of a time out, with code 20, "Error when processing event", which was what I saw in the test failure message. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572193#comment-16572193 ] Zhengdai Hu edited comment on STORM-3128 at 8/7/18 7:25 PM: It doesn't, but it's likely to cause a timeout error. If I remembered correctly, StormTimer terminates the process in case of a time out, with code 20, "Error when processing event", which was what I saw in the test failure message. [~Srdo] was (Author: zhengdai): It doesn't, but it's likely to cause a timeout error. If I remembered correctly, StormTimer terminates the process in case of a time out, with code 20, "Error when processing event", which was what I saw in the test failure message. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572202#comment-16572202 ] Zhengdai Hu commented on STORM-3128: I might be wrong. Let me take a look at StormTimer again. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3128: --- Description: In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created and tries but failed to connect to zookeeper due to connection error. I'm not sure if this compromises the test even though it is passed after connection retry timeout. But it's nice to keep in mind. {noformat} 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) 2018-06-27 13:05:28.032 [main] INFO org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_171] at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {noformat} I managed to track down the source where the exception is thrown, but it's really strange that this is called by a StormTimer inside Supervisor, which is not declared anywhere in this test. I'm completely lost by now. {noformat} 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR org.apache.storm.zookeeper.ClientZookeeper - e: {} org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /supervisors at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35) ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:145) [storm-client-2.0.0-SNAPSHOT.jar:?] at org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:292) [storm-client-2.0.0-SNAPSHOT.jar:?] at org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70) [storm-client-2.0.0-SNAPSHOT.jar:?] at org.apache.storm.cluster.ZKStateStorage.set_ephemeral_node(ZKStateStorage.java:129) [storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.cluster.StormClusterStateImpl.supervisorHeartbeat(StormClusterStateImpl.java:522) [storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] at org.apache.storm.daemon.supervisor.timer.SupervisorHeartbeat.run(SupervisorHeartbeat.java:96) [classes/:?] at org.apache.storm.StormTimer$1.run(StormTimer.java:110) [storm-client-2.0.0-SNAPSHOT.jar:?] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) [st
[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502 ] Zhengdai Hu edited comment on STORM-3128 at 8/8/18 4:59 PM: I added the stack trace of the exception that crashed the test. It looks like other build failure are likely to be caused by the same error. It's really strange though. [~Srdo] was (Author: zhengdai): I added the stack trace of the exception that crashed the test. It looks like other build failure are likely to be caused by the same error. It's really strange though. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} > I managed to track down the source where the exception is thrown, but it's > really strange that this is called by a StormTimer inside Supervisor, which > is not declared anywhere in this test. I'm completely lost by now. > {noformat} > 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR > org.apache.storm.zookeeper.ClientZookeeper - e: {} > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /supervisors > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] >
[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502 ] Zhengdai Hu edited comment on STORM-3128 at 8/8/18 4:59 PM: I added the stack trace of the exception that crashed the test. It looks like other build failures are likely to be caused by the same error. It's really strange though. [~Srdo] was (Author: zhengdai): I added the stack trace of the exception that crashed the test. It looks like other build failure are likely to be caused by the same error. It's really strange though. [~Srdo] > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} > I managed to track down the source where the exception is thrown, but it's > really strange that this is called by a StormTimer inside Supervisor, which > is not declared anywhere in this test. I'm completely lost by now. > {noformat} > 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR > org.apache.storm.zookeeper.ClientZookeeper - e: {} > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /supervisors > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNA
[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest
[ https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502 ] Zhengdai Hu commented on STORM-3128: I added the stack trace of the exception that crashed the test. It looks like other build failure are likely to be caused by the same error. It's really strange though. > Connection refused error in AsyncLocalizerTest > -- > > Key: STORM-3128 > URL: https://issues.apache.org/jira/browse/STORM-3128 > Project: Apache Storm > Issue Type: Bug > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created > and tries but failed to connect to zookeeper due to connection error. I'm not > sure if this compromises the test even though it is passed after connection > retry timeout. But it's nice to keep in mind. > {noformat} > 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket > connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to > authenticate using SASL (unknown error) > 2018-06-27 13:05:28.032 [main] INFO > org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Default schema > 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for > server null, unexpected error, closing socket connection and attempting > reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > ~[?:1.8.0_171] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > ~[?:1.8.0_171] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > {noformat} > I managed to track down the source where the exception is thrown, but it's > really strange that this is called by a StormTimer inside Supervisor, which > is not declared anywhere in this test. I'm completely lost by now. > {noformat} > 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR > org.apache.storm.zookeeper.ClientZookeeper - e: {} > org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /supervisors > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35) > ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] > at > org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:145) > [storm-client-2.0.0-SNAPSHOT.jar:?] > at > org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:292) > [storm-client-2.0.0-
[jira] [Created] (STORM-3186) Customizable configuration for metric reporting interval
Zhengdai Hu created STORM-3186: -- Summary: Customizable configuration for metric reporting interval Key: STORM-3186 URL: https://issues.apache.org/jira/browse/STORM-3186 Project: Apache Storm Issue Type: Improvement Components: storm-server, storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu In current implementation, all subclass of ScheduledReporter are hard coded report interval of 10 seconds. However I think it would make sense to make this an item in configuration so user can change the reporting frequency to fit their needs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3186) Customizable configuration for metric reporting interval
[ https://issues.apache.org/jira/browse/STORM-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3186: --- Description: In current implementation, all subclass of ScheduledReporter are hard coded report interval of 10 seconds. However I think it would make sense to make this an item in configuration so user can change the reporting frequency to fit their needs. See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617 was:In current implementation, all subclass of ScheduledReporter are hard coded report interval of 10 seconds. However I think it would make sense to make this an item in configuration so user can change the reporting frequency to fit their needs. > Customizable configuration for metric reporting interval > > > Key: STORM-3186 > URL: https://issues.apache.org/jira/browse/STORM-3186 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server, storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In current implementation, all subclass of ScheduledReporter are hard coded > report interval of 10 seconds. However I think it would make sense to make > this an item in configuration so user can change the reporting frequency to > fit their needs. > See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3186) Customizable configuration for metric reporting interval
[ https://issues.apache.org/jira/browse/STORM-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575004#comment-16575004 ] Zhengdai Hu commented on STORM-3186: See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617 > Customizable configuration for metric reporting interval > > > Key: STORM-3186 > URL: https://issues.apache.org/jira/browse/STORM-3186 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server, storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > In current implementation, all subclass of ScheduledReporter are hard coded > report interval of 10 seconds. However I think it would make sense to make > this an item in configuration so user can change the reporting frequency to > fit their needs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3187) Nimbus code refactoring and cleanup
Zhengdai Hu created STORM-3187: -- Summary: Nimbus code refactoring and cleanup Key: STORM-3187 URL: https://issues.apache.org/jira/browse/STORM-3187 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu Nimbus.java is bloated with many legacy code that are convoluted and inefficient. It would be nice if we can clean up the code a bit, especially now that we're moving away from Clojure. Several suggestion are made in STORM-3133, including, 1. Remove logging that is of the same purpose of some metrics: https://github.com/apache/storm/pull/2764#discussion_r203727117 2. Refactor data type of return values/parameters to improve readability: https://github.com/apache/storm/pull/2764#discussion_r208699933 https://github.com/apache/storm/pull/2764#discussion_r208721202 https://github.com/apache/storm/pull/2764#discussion-diff-208707855R2230 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3187) Nimbus code refactoring and cleanup
[ https://issues.apache.org/jira/browse/STORM-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3187: --- Description: Nimbus.java is bloated with many legacy code that are convoluted and inefficient. It would be nice if we can clean up the code a bit, especially now that we're moving away from Clojure. Several suggestion are made in STORM-3133, including, 1. Remove logging that is of the same purpose of some metrics: https://github.com/apache/storm/pull/2764#discussion_r203727117 2. Refactor data type of return values/parameters to improve readability: https://github.com/apache/storm/pull/2764#discussion_r208699933 https://github.com/apache/storm/pull/2764#discussion_r208721202 https://github.com/apache/storm/pull/2764#discussion_r208707855 3. Other performance improvement https://github.com/apache/storm/pull/2764#discussion_r208714561 was: Nimbus.java is bloated with many legacy code that are convoluted and inefficient. It would be nice if we can clean up the code a bit, especially now that we're moving away from Clojure. Several suggestion are made in STORM-3133, including, 1. Remove logging that is of the same purpose of some metrics: https://github.com/apache/storm/pull/2764#discussion_r203727117 2. Refactor data type of return values/parameters to improve readability: https://github.com/apache/storm/pull/2764#discussion_r208699933 https://github.com/apache/storm/pull/2764#discussion_r208721202 https://github.com/apache/storm/pull/2764#discussion-diff-208707855R2230 > Nimbus code refactoring and cleanup > --- > > Key: STORM-3187 > URL: https://issues.apache.org/jira/browse/STORM-3187 > Project: Apache Storm > Issue Type: Improvement > Components: storm-server >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > Nimbus.java is bloated with many legacy code that are convoluted and > inefficient. It would be nice if we can clean up the code a bit, especially > now that we're moving away from Clojure. > Several suggestion are made in STORM-3133, including, > 1. Remove logging that is of the same purpose of some metrics: > https://github.com/apache/storm/pull/2764#discussion_r203727117 > 2. Refactor data type of return values/parameters to improve readability: > https://github.com/apache/storm/pull/2764#discussion_r208699933 > https://github.com/apache/storm/pull/2764#discussion_r208721202 > https://github.com/apache/storm/pull/2764#discussion_r208707855 > 3. Other performance improvement > https://github.com/apache/storm/pull/2764#discussion_r208714561 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3188) Removing try-catch block from getAndResetWorkerHeartbeats
Zhengdai Hu created STORM-3188: -- Summary: Removing try-catch block from getAndResetWorkerHeartbeats Key: STORM-3188 URL: https://issues.apache.org/jira/browse/STORM-3188 Project: Apache Storm Issue Type: Improvement Components: storm-server Affects Versions: 2.0.0 Reporter: Zhengdai Hu After refactoring, SupervisorUtils.readWorkerHeartbeats no longer throws checked Exceptions. I'm wondering if we still want to keep the try-catch block to wrap around its invocation in getAndResetWorkerHeartbeats in ReportWorkerHeartbeats.java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3189) Remove unused data file LogViewer api
Zhengdai Hu created STORM-3189: -- Summary: Remove unused data file LogViewer api Key: STORM-3189 URL: https://issues.apache.org/jira/browse/STORM-3189 Project: Apache Storm Issue Type: Improvement Components: storm-webapp Affects Versions: 2.0.0 Reporter: Zhengdai Hu Discovered in STORM-3133. `findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which contains a field `fileOffset`. However, in current implementation, `fileOffset` behaves a bit odd and is not being used anywhere in the app. I'm wondering if we should remove this field altogether Specifically, the difference in behavior follows, `fileOffset is passed in as the desired amount of file to skip in search (equiv to index of first file to search) if desired amount of matches found, fileOffset will be the index of last scanned file (starting from 0). if not enough matches found in all logs, fileOffset will be number of all logs (equiv to one past the index of last file) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3189) Remove unused data file LogViewer api
[ https://issues.apache.org/jira/browse/STORM-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3189: --- Description: Discovered in STORM-3133. `findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which contains a field `fileOffset`. However, in current implementation, `fileOffset` behaves a bit odd and is not being used anywhere in the app. I'm wondering if we should remove this field altogether Specifically, the difference in behavior follows, `fileOffset is passed in as the desired amount of file to skip in search (equiv to index of first file to search) if desired amount of matches found, fileOffset will be the index of last scanned file (starting from 0). if not enough matches found in all logs, fileOffset will be number of all logs (equiv to one past the index of last file) See https://github.com/apache/storm/pull/2754#discussion_r208691016 https://github.com/apache/storm/pull/2754#discussion_r208726809 was: Discovered in STORM-3133. `findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which contains a field `fileOffset`. However, in current implementation, `fileOffset` behaves a bit odd and is not being used anywhere in the app. I'm wondering if we should remove this field altogether Specifically, the difference in behavior follows, `fileOffset is passed in as the desired amount of file to skip in search (equiv to index of first file to search) if desired amount of matches found, fileOffset will be the index of last scanned file (starting from 0). if not enough matches found in all logs, fileOffset will be number of all logs (equiv to one past the index of last file) > Remove unused data file LogViewer api > - > > Key: STORM-3189 > URL: https://issues.apache.org/jira/browse/STORM-3189 > Project: Apache Storm > Issue Type: Improvement > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Zhengdai Hu >Priority: Major > > Discovered in STORM-3133. > `findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which > contains a field `fileOffset`. However, in current implementation, > `fileOffset` behaves a bit odd and is not being used anywhere in the app. I'm > wondering if we should remove this field altogether > Specifically, the difference in behavior follows, > `fileOffset is passed in as the desired amount of file to skip in search > (equiv to index of first file to search) > if desired amount of matches found, fileOffset will be the index of last > scanned file (starting from 0). > if not enough matches found in all logs, fileOffset will be number of all > logs (equiv to one past the index of last file) > See > https://github.com/apache/storm/pull/2754#discussion_r208691016 > https://github.com/apache/storm/pull/2754#discussion_r208726809 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (STORM-3190) Unnecessary null check of directory stream in LogCleaner
[ https://issues.apache.org/jira/browse/STORM-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576392#comment-16576392 ] Zhengdai Hu commented on STORM-3190: We can further simplify this. {code:java} private long lastModifiedTimeWorkerLogdir(File logDir) { long dirModified = logDir.lastModified(); try (DirectoryStream dirStream = directoryCleaner.getStreamForDirectory(logDir)) { if (dirStream != null) { try { return StreamSupport.stream(dirStream.spliterator(), false) .reduce(dirModified, (maximum, path) -> { long curr = path.toFile().lastModified(); return curr > maximum ? curr : maximum; }, BinaryOperator.maxBy(Long::compareTo)); } catch (Exception ex) { LOG.error(ex.getMessage(), ex); } } } catch (IOException ignored) {} return dirModified; } {code} > Unnecessary null check of directory stream in LogCleaner > > > Key: STORM-3190 > URL: https://issues.apache.org/jira/browse/STORM-3190 > Project: Apache Storm > Issue Type: Task > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Stig Rohde Døssing >Priority: Trivial > > This should be using try-with-resources > https://github.com/apache/storm/blob/a1b3e02aab57b4e458b8b5763a0d467852906bb7/storm-webapp/src/main/java/org/apache/storm/daemon/logviewer/utils/LogCleaner.java#L263 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (STORM-3190) Unnecessary null check of directory stream in LogCleaner
[ https://issues.apache.org/jira/browse/STORM-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576392#comment-16576392 ] Zhengdai Hu edited comment on STORM-3190 at 8/10/18 3:18 PM: - We can further simplify this. {code:java} private long lastModifiedTimeWorkerLogdir(File logDir) { long dirModified = logDir.lastModified(); try (DirectoryStream dirStream = directoryCleaner.getStreamForDirectory(logDir)) { if (dirStream != null) { try { return StreamSupport.stream(dirStream.spliterator(), false) .mapToLong(path -> path.toFile().lastModified()) .reduce(dirModified, Math::max); } catch (Exception ex) { LOG.error(ex.getMessage(), ex); } } } catch (IOException ignored) {} return dirModified; } {code} was (Author: zhengdai): We can further simplify this. {code:java} private long lastModifiedTimeWorkerLogdir(File logDir) { long dirModified = logDir.lastModified(); try (DirectoryStream dirStream = directoryCleaner.getStreamForDirectory(logDir)) { if (dirStream != null) { try { return StreamSupport.stream(dirStream.spliterator(), false) .reduce(dirModified, (maximum, path) -> { long curr = path.toFile().lastModified(); return curr > maximum ? curr : maximum; }, BinaryOperator.maxBy(Long::compareTo)); } catch (Exception ex) { LOG.error(ex.getMessage(), ex); } } } catch (IOException ignored) {} return dirModified; } {code} > Unnecessary null check of directory stream in LogCleaner > > > Key: STORM-3190 > URL: https://issues.apache.org/jira/browse/STORM-3190 > Project: Apache Storm > Issue Type: Task > Components: storm-webapp >Affects Versions: 2.0.0 >Reporter: Stig Rohde Døssing >Priority: Trivial > > This should be using try-with-resources > https://github.com/apache/storm/blob/a1b3e02aab57b4e458b8b5763a0d467852906bb7/storm-webapp/src/main/java/org/apache/storm/daemon/logviewer/utils/LogCleaner.java#L263 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3191) Migrate more items from
Zhengdai Hu created STORM-3191: -- Summary: Migrate more items from Key: STORM-3191 URL: https://issues.apache.org/jira/browse/STORM-3191 Project: Apache Storm Issue Type: Improvement Reporter: Zhengdai Hu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics
[ https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3191: --- Summary: Migrate more items from ClusterSummary to metrics (was: Migrate more items from ) > Migrate more items from ClusterSummary to metrics > - > > Key: STORM-3191 > URL: https://issues.apache.org/jira/browse/STORM-3191 > Project: Apache Storm > Issue Type: Improvement >Reporter: Zhengdai Hu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics
[ https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3191: --- Priority: Minor (was: Major) > Migrate more items from ClusterSummary to metrics > - > > Key: STORM-3191 > URL: https://issues.apache.org/jira/browse/STORM-3191 > Project: Apache Storm > Issue Type: Improvement >Reporter: Zhengdai Hu >Priority: Minor > > The following summary items haven't been ported as nimbus metrics yet. > //Declared in StormConf. I don't see the value in reporting so. > SUPERVISOR_TOTAL_RESOURCE, > //May be able to aggregate based on status; > TOPOLOGY_STATUS, > TOPOLOGY_SCHED_STATUS, > //May be aggregated: e.g., distinct values > NUM_DISTINCT_NIMBUS_VERSION; -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics
[ https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3191: --- Description: The following summary items haven't been ported as nimbus metrics yet. //Declared in StormConf. I don't see the value in reporting so. SUPERVISOR_TOTAL_RESOURCE, //May be able to aggregate based on status; TOPOLOGY_STATUS, TOPOLOGY_SCHED_STATUS, //May be aggregated: e.g., distinct values NUM_DISTINCT_NIMBUS_VERSION; > Migrate more items from ClusterSummary to metrics > - > > Key: STORM-3191 > URL: https://issues.apache.org/jira/browse/STORM-3191 > Project: Apache Storm > Issue Type: Improvement >Reporter: Zhengdai Hu >Priority: Major > > The following summary items haven't been ported as nimbus metrics yet. > //Declared in StormConf. I don't see the value in reporting so. > SUPERVISOR_TOTAL_RESOURCE, > //May be able to aggregate based on status; > TOPOLOGY_STATUS, > TOPOLOGY_SCHED_STATUS, > //May be aggregated: e.g., distinct values > NUM_DISTINCT_NIMBUS_VERSION; -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics
[ https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhengdai Hu updated STORM-3191: --- Description: The following summary items haven't been ported as nimbus metrics yet. //Declared in StormConf. I don't see the value in reporting so. SUPERVISOR_TOTAL_RESOURCE, //May be able to aggregate based on status; TOPOLOGY_STATUS, TOPOLOGY_SCHED_STATUS, //May be aggregated: e.g., distinct values NUM_DISTINCT_NIMBUS_VERSION; was: The following summary items haven't been ported as nimbus metrics yet. //Declared in StormConf. I don't see the value in reporting so. SUPERVISOR_TOTAL_RESOURCE, //May be able to aggregate based on status; TOPOLOGY_STATUS, TOPOLOGY_SCHED_STATUS, //May be aggregated: e.g., distinct values NUM_DISTINCT_NIMBUS_VERSION; > Migrate more items from ClusterSummary to metrics > - > > Key: STORM-3191 > URL: https://issues.apache.org/jira/browse/STORM-3191 > Project: Apache Storm > Issue Type: Improvement >Reporter: Zhengdai Hu >Priority: Minor > > The following summary items haven't been ported as nimbus metrics yet. > //Declared in StormConf. I don't see the value in reporting so. > SUPERVISOR_TOTAL_RESOURCE, > //May be able to aggregate based on status; > TOPOLOGY_STATUS, > TOPOLOGY_SCHED_STATUS, > //May be aggregated: e.g., distinct values > NUM_DISTINCT_NIMBUS_VERSION; -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (STORM-3193) Improve LogviewerLogSearchHandler
Zhengdai Hu created STORM-3193: -- Summary: Improve LogviewerLogSearchHandler Key: STORM-3193 URL: https://issues.apache.org/jira/browse/STORM-3193 Project: Apache Storm Issue Type: Improvement Reporter: Zhengdai Hu One thing that is worthy of noticing: Storm UI currently interweaves different search API regarding searching functionalities and it's kind of confusing. Specifically: For the search button at homepage, it uses a single deep search API to search all ports (server side process), both archived and non-archived. For non-archived search at a specific topology page, it invokes search API on each port inside a loop (client side process). For archived search at a specific topology page, it invokes deep search API (search-archived=on) on each port inside a loop (client side process) As a result, metrics for these APIs may not accurately reflect how many searches are invoked from client's perspective. Additionally, `findNMatches` can be simplified, along with STORM-3189 -- This message was sent by Atlassian JIRA (v7.6.3#76005)