from:"Zhengdai Hu \(JIRA\)"

[jira] [Created] (STORM-3092) Metrics Reporter and Shutdown Hook on Supervisor not properly set up at launchDaemon

2018-06-01 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3092:
--

 Summary: Metrics Reporter and Shutdown Hook on Supervisor not 
properly set up at launchDaemon
 Key: STORM-3092
 URL: https://issues.apache.org/jira/browse/STORM-3092
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


The bug was introduced in commit 0dac58b0aa82133df242b3b2ebeb65bfea7d63cc, when 
launchSupervisorThriftServer method is invoked in the launchDaemon method in 
Supervisor class. launchSupervisorThriftServer() invokes a blocking call to 
thrift server under the hood, hence preventing 
Utils.addShutdownHookWithForceKillIn1Sec and 
StormMetricsRegistry.startMetricsReporters from correctly called. 

 

The bug can be solved by moving launchSupervisorThriftServer to the end of the 
code block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (STORM-3092) Metrics Reporter and Shutdown Hook on Supervisor not properly set up at launchDaemon

2018-06-04 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu resolved STORM-3092.

Resolution: Fixed

> Metrics Reporter and Shutdown Hook on Supervisor not properly set up at 
> launchDaemon
> 
>
> Key: STORM-3092
> URL: https://issues.apache.org/jira/browse/STORM-3092
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> The bug was introduced in commit 0dac58b0aa82133df242b3b2ebeb65bfea7d63cc, 
> when launchSupervisorThriftServer method is invoked in the launchDaemon 
> method in Supervisor class. launchSupervisorThriftServer() invokes a blocking 
> call to thrift server under the hood, hence preventing 
> Utils.addShutdownHookWithForceKillIn1Sec and 
> StormMetricsRegistry.startMetricsReporters from correctly called. 
>  
> The bug can be solved by moving launchSupervisorThriftServer to the end of 
> the code block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3098) Fix bug in filterChangingBlobsFor() in Slot.java

2018-06-07 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3098:
--

 Summary: Fix bug in filterChangingBlobsFor() in Slot.java
 Key: STORM-3098
 URL: https://issues.apache.org/jira/browse/STORM-3098
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


The following method is not implemented correctly

{code:java}
 private static DynamicState filterChangingBlobsFor(DynamicState dynamicState, 
final LocalAssignment assignment) {
if (!dynamicState.changingBlobs.isEmpty()) {
return dynamicState;
}

HashSet savedBlobs = new 
HashSet<>(dynamicState.changingBlobs.size());
for (BlobChanging rc : dynamicState.changingBlobs) {
if (forSameTopology(assignment, rc.assignment)) {
savedBlobs.add(rc);
} else {
rc.latch.countDown();
}
}
return dynamicState.withChangingBlobs(savedBlobs);
}

{code}

It doesn't modify dynamicState in anyway.
The solution is to remove the negation in the first if statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (STORM-3098) Fix bug in filterChangingBlobsFor() in Slot.java

2018-06-07 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu reassigned STORM-3098:
--

Assignee: Zhengdai Hu

> Fix bug in filterChangingBlobsFor() in Slot.java
> 
>
> Key: STORM-3098
> URL: https://issues.apache.org/jira/browse/STORM-3098
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> The following method is not implemented correctly
> {code:java}
>  private static DynamicState filterChangingBlobsFor(DynamicState 
> dynamicState, final LocalAssignment assignment) {
> if (!dynamicState.changingBlobs.isEmpty()) {
> return dynamicState;
> }
> HashSet savedBlobs = new 
> HashSet<>(dynamicState.changingBlobs.size());
> for (BlobChanging rc : dynamicState.changingBlobs) {
> if (forSameTopology(assignment, rc.assignment)) {
> savedBlobs.add(rc);
> } else {
> rc.latch.countDown();
> }
> }
> return dynamicState.withChangingBlobs(savedBlobs);
> }
> {code}
> It doesn't modify dynamicState in anyway.
> The solution is to remove the negation in the first if statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3099) Extend metrics on supervisor and workers

2018-06-08 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3099:
--

 Summary: Extend metrics on supervisor and workers
 Key: STORM-3099
 URL: https://issues.apache.org/jira/browse/STORM-3099
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu


This patch serves to extend metrics on supervisor and worker. Currently the 
following metrics are being implemented, including but not limited to:

Worker:
# Kill Count by Category - Assignment Change/HB too old/Heap Space
# Time spent in each state
# Time to Actually Kill worker (from identifying need by supervisor and actual 
change in the state of the worker) - per worker?
# Time to start worker for topology from reading assignment for the first time.
# Worker cleanup Time/Worker cleanup Retries
# Worker Suicide Count - category: internal error or Assignment Change

Supervisor:
# Supervisor restart Count 
# Blobstore (Request to download time) 
 # Download time individual blob (inside localizer) localizer gettting 
requst to actually download hdfs request to finish
 # Download rate individual blob (inside localizer)
 # Supervisor localizer thread blob download - how long (outside localizer)
# Blobstore Update due to Version change Cnts
# Blobstore Storage by users

There might be more metrics added later. 

This patch will also refactor code in relevant files. Bugs found during the 
process will be reported in other issues and handled separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3099) Extend metrics on supervisor and workers

2018-06-08 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3099:
---
Description: 
This patch serves to extend metrics on supervisor and worker. Currently the 
following metrics are being implemented, including but not limited to:

Worker:
# Kill Count by Category - Assignment Change/HB too old/Heap Space
# Time spent in each state
# Time to Actually Kill worker (from identifying need by supervisor and actual 
change in the state of the worker) - per worker?
# Time to start worker for topology from reading assignment for the first time.
# Worker cleanup Time/Worker cleanup Retries
# Worker Suicide Count - category: internal error or Assignment Change

Supervisor:
# Supervisor restart Count 
# Blobstore (Request to download time) 
- # Download time individual blob (inside localizer) localizer gettting 
requst to actually download hdfs request to finish
- # Download rate individual blob (inside localizer)
- # Supervisor localizer thread blob download - how long (outside localizer)
# Blobstore Update due to Version change Cnts
# Blobstore Storage by users

There might be more metrics added later. 

This patch will also refactor code in relevant files. Bugs found during the 
process will be reported in other issues and handled separately.

  was:
This patch serves to extend metrics on supervisor and worker. Currently the 
following metrics are being implemented, including but not limited to:

Worker:
# Kill Count by Category - Assignment Change/HB too old/Heap Space
# Time spent in each state
# Time to Actually Kill worker (from identifying need by supervisor and actual 
change in the state of the worker) - per worker?
# Time to start worker for topology from reading assignment for the first time.
# Worker cleanup Time/Worker cleanup Retries
# Worker Suicide Count - category: internal error or Assignment Change

Supervisor:
# Supervisor restart Count 
# Blobstore (Request to download time) 
 # Download time individual blob (inside localizer) localizer gettting 
requst to actually download hdfs request to finish
 # Download rate individual blob (inside localizer)
 # Supervisor localizer thread blob download - how long (outside localizer)
# Blobstore Update due to Version change Cnts
# Blobstore Storage by users

There might be more metrics added later. 

This patch will also refactor code in relevant files. Bugs found during the 
process will be reported in other issues and handled separately.


> Extend metrics on supervisor and workers
> 
>
> Key: STORM-3099
> URL: https://issues.apache.org/jira/browse/STORM-3099
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>
> This patch serves to extend metrics on supervisor and worker. Currently the 
> following metrics are being implemented, including but not limited to:
> Worker:
> # Kill Count by Category - Assignment Change/HB too old/Heap Space
> # Time spent in each state
> # Time to Actually Kill worker (from identifying need by supervisor and 
> actual change in the state of the worker) - per worker?
> # Time to start worker for topology from reading assignment for the first 
> time.
> # Worker cleanup Time/Worker cleanup Retries
> # Worker Suicide Count - category: internal error or Assignment Change
> Supervisor:
> # Supervisor restart Count 
> # Blobstore (Request to download time) 
> - # Download time individual blob (inside localizer) localizer gettting 
> requst to actually download hdfs request to finish
> - # Download rate individual blob (inside localizer)
> - # Supervisor localizer thread blob download - how long (outside 
> localizer)
> # Blobstore Update due to Version change Cnts
> # Blobstore Storage by users
> There might be more metrics added later. 
> This patch will also refactor code in relevant files. Bugs found during the 
> process will be reported in other issues and handled separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3101) Select Registry metrics by calling daemon

2018-06-12 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3101:
--

 Summary: Select Registry metrics by calling daemon
 Key: STORM-3101
 URL: https://issues.apache.org/jira/browse/STORM-3101
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Metrics that are registered using StormMetricRegistry all added through static 
method from the registry class, and attached to a singleton MetricRegistry 
object per process. Currently most metrics are bound to classes (static), so 
the issue occurs when metrics from irrelevant components are accidentally 
registered in class initialization phase. 

For example, a process running supervisor daemon will incorrectly register 
metrics from nimbus when BasicContainer class is initialized and statically 
imports 
"org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", 
which triggers initialization of Nimbus class and all metrics registration, 
even though no functionalities of nimbus daemon will be used and no nimbus 
metrics will be updated. 

This creates many garbage metrics and makes metrics hard to read. Therefore we 
should filter metrics registration by the type of daemon that the process 
actually runs.

For implementation please see the pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3101) Select Registry metrics by running daemon

2018-06-12 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3101:
---
Summary: Select Registry metrics by running daemon  (was: Select Registry 
metrics by calling daemon)

> Select Registry metrics by running daemon
> -
>
> Key: STORM-3101
> URL: https://issues.apache.org/jira/browse/STORM-3101
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> Metrics that are registered using StormMetricRegistry all added through 
> static method from the registry class, and attached to a singleton 
> MetricRegistry object per process. Currently most metrics are bound to 
> classes (static), so the issue occurs when metrics from irrelevant components 
> are accidentally registered in class initialization phase. 
> For example, a process running supervisor daemon will incorrectly register 
> metrics from nimbus when BasicContainer class is initialized and statically 
> imports 
> "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", 
> which triggers initialization of Nimbus class and all metrics registration, 
> even though no functionalities of nimbus daemon will be used and no nimbus 
> metrics will be updated. 
> This creates many garbage metrics and makes metrics hard to read. Therefore 
> we should filter metrics registration by the type of daemon that the process 
> actually runs.
> For implementation please see the pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3104) Delayed launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
There is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_UPDA

[jira] [Created] (STORM-3104) Delayed launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3104:
--

 Summary: Delayed launch due to accidental transitioning in state 
machine
 Key: STORM-3104
 URL: https://issues.apache.org/jira/browse/STORM-3104
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


There is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:shell}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Summary: Delayed worker launch due to accidental transitioning in state 
machine  (was: Delayed launch due to accidental transitioning in state machine)

> Delayed worker launch due to accidental transitioning in state machine
> --
>
> Key: STORM-3104
> URL: https://issues.apache.org/jira/browse/STORM-3104
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> There is a comparison in 
> {code:java}
> handleWaitingForBlobUpdate()
> {code}
>  between dynamic state's current assignment and new assignment, which 
> accidentally route back state machine just transitioned from 
> WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is 
> because the current assignment in this case is highly likely to be null (I'm 
> not sure if it's guaranteed) and causes delay for a worker to start/restart.
> The symptom is able to be reproduced by launching an empty supervisor and 
> submit any topology. Here's the log sample:
> {code:sh}
> 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6024 -> EMPTY msInState: 6024
> 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
> EMPTY to WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
> are []
> 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
> LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_en

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_U

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_B

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty supervisor and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_UPD

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because 
the current assignment in this case is highly likely to be null (I'm not sure 
if it's guaranteed) and causes delay for a worker to start/restart.

The symptom is able to be reproduced by launching an empty supervisor and 
submit any topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample:

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
*2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
**2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
**2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
*2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null (I'm not sure if 
it's guaranteed). This causes delay for a worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample (relevant transition starting from 2018-06-13 
16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]):

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Sl

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-14 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Description: 
In Slot.java, there is a comparison in 
{code:java}
handleWaitingForBlobUpdate()
{code}
 between dynamic state's current assignment and new assignment, which 
accidentally route back state machine just transitioned from 
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
current assignment in this case is highly likely to be null and different from 
new assignment (I'm not sure if it's guaranteed). This causes delay for a 
worker to start/restart.

The symptom can be reproduced by launching an empty storm server and submit any 
topology. Here's the log sample (relevant transition starting from 2018-06-13 
16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]):

{code:sh}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY 
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState: 
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION 
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
LocalAssignment(topology_id:test-1-1528927024, 
executors:[ExecutorInfo(task_start:10, task_end:10), 
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1, 
task_end:1), ExecutorInfo(task_start:13, task_end:13)], 
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0, 
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ? 
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending 
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2

[jira] [Updated] (STORM-3101) Fix unexpected metrics registration in StormMetricsRegistry

2018-06-14 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3101:
---
Summary: Fix unexpected metrics registration in StormMetricsRegistry  (was: 
Select Registry metrics by running daemon)

> Fix unexpected metrics registration in StormMetricsRegistry
> ---
>
> Key: STORM-3101
> URL: https://issues.apache.org/jira/browse/STORM-3101
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Metrics that are registered using StormMetricRegistry all added through 
> static method from the registry class, and attached to a singleton 
> MetricRegistry object per process. Currently most metrics are bound to 
> classes (static), so the issue occurs when metrics from irrelevant components 
> are accidentally registered in class initialization phase. 
> For example, a process running supervisor daemon will incorrectly register 
> metrics from nimbus when BasicContainer class is initialized and statically 
> imports 
> "org.apache.storm.daemon.nimbus.Nimbus.MIN_VERSION_SUPPORT_RPC_HEARTBEAT", 
> which triggers initialization of Nimbus class and all metrics registration, 
> even though no functionalities of nimbus daemon will be used and no nimbus 
> metrics will be updated. 
> This creates many garbage metrics and makes metrics hard to read. Therefore 
> we should filter metrics registration by the type of daemon that the process 
> actually runs.
> For implementation please see the pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers

2018-06-15 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu reassigned STORM-3109:
--

Assignee: Zhengdai Hu

> Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers
> -
>
> Key: STORM-3109
> URL: https://issues.apache.org/jira/browse/STORM-3109
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0, 1.1.0, 1.0.3, 1.x, 1.0.4, 1.1.1, 1.2.0, 1.1.2, 
> 1.0.5, 1.0.6, 1.2.1, 1.1.3, 1.2.2
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> When `STORM_LOCAL_DIR` is set as a relative path, the original implementation 
> incorrectly append the `STORM_LOCAL_DIR` after the current working directory 
> upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to 
> the incorrect location, so `storm kill_workers` can't actually kill workers 
> at all.
> See pull request for implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers

2018-06-15 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3109:
--

 Summary: Wrong canonical path set to STORM_LOCAL_DIR in storm 
kill_workers
 Key: STORM-3109
 URL: https://issues.apache.org/jira/browse/STORM-3109
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-core
Affects Versions: 1.2.2, 1.1.3, 1.2.1, 1.0.6, 1.0.5, 1.1.2, 1.2.0, 1.1.1, 
1.0.4, 1.0.3, 1.1.0, 2.0.0, 1.x
Reporter: Zhengdai Hu
 Fix For: 2.0.0


When `STORM_LOCAL_DIR` is set as a relative path, the original implementation 
incorrectly append the `STORM_LOCAL_DIR` after the current working directory 
upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to 
the incorrect location, so `storm kill_workers` can't actually kill workers at 
all.

See pull request for implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3109) Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers

2018-06-15 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3109:
---
Description: 
When `STORM_LOCAL_DIR` is set as a relative path, the original implementation 
incorrectly append the `STORM_LOCAL_DIR` after the current working directory 
upon invocation of `storm kill_workers`. If the current working directory is 
not the home directory for storm, in this `STORM_LOCAL_DIR` points to the 
incorrect location, so `storm kill_workers` can't actually kill workers at all.

See pull request for implementation.

  was:
When `STORM_LOCAL_DIR` is set as a relative path, the original implementation 
incorrectly append the `STORM_LOCAL_DIR` after the current working directory 
upon invocation of `storm kill_workers`. In this `STORM_LOCAL_DIR` points to 
the incorrect location, so `storm kill_workers` can't actually kill workers at 
all.

See pull request for implementation.


> Wrong canonical path set to STORM_LOCAL_DIR in storm kill_workers
> -
>
> Key: STORM-3109
> URL: https://issues.apache.org/jira/browse/STORM-3109
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0, 1.1.0, 1.0.3, 1.x, 1.0.4, 1.1.1, 1.2.0, 1.1.2, 
> 1.0.5, 1.0.6, 1.2.1, 1.1.3, 1.2.2
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When `STORM_LOCAL_DIR` is set as a relative path, the original implementation 
> incorrectly append the `STORM_LOCAL_DIR` after the current working directory 
> upon invocation of `storm kill_workers`. If the current working directory is 
> not the home directory for storm, in this `STORM_LOCAL_DIR` points to the 
> incorrect location, so `storm kill_workers` can't actually kill workers at 
> all.
> See pull request for implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3104) Delayed worker launch due to accidental transitioning in state machine

2018-06-19 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3104:
---
Priority: Major  (was: Critical)

> Delayed worker launch due to accidental transitioning in state machine
> --
>
> Key: STORM-3104
> URL: https://issues.apache.org/jira/browse/STORM-3104
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> In Slot.java, there is a comparison in 
> {code:java}
> handleWaitingForBlobUpdate()
> {code}
>  between dynamic state's current assignment and new assignment, which 
> accidentally route back state machine just transitioned from 
> WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION, because the 
> current assignment in this case is highly likely to be null and different 
> from new assignment (I'm not sure if it's guaranteed). This causes delay for 
> a worker to start/restart.
> The symptom can be reproduced by launching an empty storm server and submit 
> any topology. Here's the log sample (relevant transition starting from 
> 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG]):
> {code:sh}
> 2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6024 -> EMPTY msInState: 6024
> 2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from 
> EMPTY to WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY 
> msInState: 6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs 
> are []
> 2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 1003
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING 
> LOCAL TOPO BLOB TOPO_CODE test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> 
> WAITING_FOR_BLOB_LOCALIZATION msInState: 2005
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE 
> WAITING_FOR_BLOB_LOCALIZATION
> 2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs 
> [BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_start:7, task_end:7), 
> ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:13, 
> task_end:13)], resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, 
> cpu:60.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, 
> resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0, 
> cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to 
> pending...
> 2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization 
> LocalAssignment(topology_id:test-1-1528927024, 
> executors:[ExecutorInfo(task_start:10, task_end:10), 
> ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4, 
> task_end:4), ExecutorInfo(task_sta

[jira] [Updated] (STORM-3099) Extend metrics on supervisor, workers, and DRPC

2018-06-26 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3099:
---
Description: 
This patch serves to extend metrics on supervisor and worker. Currently the 
following metrics are being implemented, including but not limited to:

Worker:
# Kill Count by Category - Assignment Change/HB too old/Heap Space
# Time spent in each state
# Time to Actually Kill worker (from identifying need by supervisor and actual 
change in the state of the worker) - per worker?
# Time to start worker for topology from reading assignment for the first time.
# Worker cleanup Time/Worker cleanup Retries
# Worker Suicide Count - category: internal error or Assignment Change

Supervisor:
# Supervisor restart Count 
# Blobstore (Request to download time) 
- # Download time individual blob (inside localizer) localizer gettting 
requst to actually download hdfs request to finish
- # Download rate individual blob (inside localizer)
- # Supervisor localizer thread blob download - how long (outside localizer)
# Blobstore Update due to Version change Cnts
# Blobstore Storage by users

DRPC:
#  Avg/Max Time to respond to Http Request

There might be more metrics added later. 

This patch will also refactor code in relevant files. Bugs found during the 
process will be reported in other issues and handled separately.

  was:
This patch serves to extend metrics on supervisor and worker. Currently the 
following metrics are being implemented, including but not limited to:

Worker:
# Kill Count by Category - Assignment Change/HB too old/Heap Space
# Time spent in each state
# Time to Actually Kill worker (from identifying need by supervisor and actual 
change in the state of the worker) - per worker?
# Time to start worker for topology from reading assignment for the first time.
# Worker cleanup Time/Worker cleanup Retries
# Worker Suicide Count - category: internal error or Assignment Change

Supervisor:
# Supervisor restart Count 
# Blobstore (Request to download time) 
- # Download time individual blob (inside localizer) localizer gettting 
requst to actually download hdfs request to finish
- # Download rate individual blob (inside localizer)
- # Supervisor localizer thread blob download - how long (outside localizer)
# Blobstore Update due to Version change Cnts
# Blobstore Storage by users

There might be more metrics added later. 

This patch will also refactor code in relevant files. Bugs found during the 
process will be reported in other issues and handled separately.


> Extend metrics on supervisor, workers, and DRPC
> ---
>
> Key: STORM-3099
> URL: https://issues.apache.org/jira/browse/STORM-3099
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This patch serves to extend metrics on supervisor and worker. Currently the 
> following metrics are being implemented, including but not limited to:
> Worker:
> # Kill Count by Category - Assignment Change/HB too old/Heap Space
> # Time spent in each state
> # Time to Actually Kill worker (from identifying need by supervisor and 
> actual change in the state of the worker) - per worker?
> # Time to start worker for topology from reading assignment for the first 
> time.
> # Worker cleanup Time/Worker cleanup Retries
> # Worker Suicide Count - category: internal error or Assignment Change
> Supervisor:
> # Supervisor restart Count 
> # Blobstore (Request to download time) 
> - # Download time individual blob (inside localizer) localizer gettting 
> requst to actually download hdfs request to finish
> - # Download rate individual blob (inside localizer)
> - # Supervisor localizer thread blob download - how long (outside 
> localizer)
> # Blobstore Update due to Version change Cnts
> # Blobstore Storage by users
> DRPC:
> #  Avg/Max Time to respond to Http Request
> There might be more metrics added later. 
> This patch will also refactor code in relevant files. Bugs found during the 
> process will be reported in other issues and handled separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3099) Extend metrics on supervisor, workers, and DRPC

2018-06-26 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3099:
---
Summary: Extend metrics on supervisor, workers, and DRPC  (was: Extend 
metrics on supervisor and workers)

> Extend metrics on supervisor, workers, and DRPC
> ---
>
> Key: STORM-3099
> URL: https://issues.apache.org/jira/browse/STORM-3099
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This patch serves to extend metrics on supervisor and worker. Currently the 
> following metrics are being implemented, including but not limited to:
> Worker:
> # Kill Count by Category - Assignment Change/HB too old/Heap Space
> # Time spent in each state
> # Time to Actually Kill worker (from identifying need by supervisor and 
> actual change in the state of the worker) - per worker?
> # Time to start worker for topology from reading assignment for the first 
> time.
> # Worker cleanup Time/Worker cleanup Retries
> # Worker Suicide Count - category: internal error or Assignment Change
> Supervisor:
> # Supervisor restart Count 
> # Blobstore (Request to download time) 
> - # Download time individual blob (inside localizer) localizer gettting 
> requst to actually download hdfs request to finish
> - # Download rate individual blob (inside localizer)
> - # Supervisor localizer thread blob download - how long (outside 
> localizer)
> # Blobstore Update due to Version change Cnts
> # Blobstore Storage by users
> There might be more metrics added later. 
> This patch will also refactor code in relevant files. Bugs found during the 
> process will be reported in other issues and handled separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3125) Refactoring methods in Supervisor's component

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3125:
--

 Summary: Refactoring methods in Supervisor's component
 Key: STORM-3125
 URL: https://issues.apache.org/jira/browse/STORM-3125
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


This is a supplement issue page to STORM-3099, separating out refactoring work 
from metrics addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3125) Refactoring methods in Supervisor's components

2018-06-27 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3125:
---
Summary: Refactoring methods in Supervisor's components  (was: Refactoring 
methods in Supervisor's component)

> Refactoring methods in Supervisor's components
> --
>
> Key: STORM-3125
> URL: https://issues.apache.org/jira/browse/STORM-3125
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> This is a supplement issue page to STORM-3099, separating out refactoring 
> work from metrics addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3125) Refactoring methods in components for Supervisor and DRPC

2018-06-27 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3125:
---
Summary: Refactoring methods in components for Supervisor and DRPC  (was: 
Refactoring methods in Supervisor's components)

> Refactoring methods in components for Supervisor and DRPC
> -
>
> Key: STORM-3125
> URL: https://issues.apache.org/jira/browse/STORM-3125
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> This is a supplement issue page to STORM-3099, separating out refactoring 
> work from metrics addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3126) Avoid unnecessary force kill when invoking storm kill_workers

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3126:
--

 Summary: Avoid unnecessary force kill when invoking storm 
kill_workers
 Key: STORM-3126
 URL: https://issues.apache.org/jira/browse/STORM-3126
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Supervisor tries to force kill a worker before checking if it has died, leading 
to unnecessary force kill call. This is minor but does help clean up logs a 
little bit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3127) Avoid potential race condition

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3127:
--

 Summary: Avoid potential race condition 
 Key: STORM-3127
 URL: https://issues.apache.org/jira/browse/STORM-3127
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


PortAndAssignment and its call back is added after update to a blob is invoked 
asynchronously. It is not guaranteed that the new dependent worker will be 
registered before blob informs its update to listening workers. 

This can be fixed by moving addReference call up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3128:
--

 Summary: Connection refused error in AsyncLocalizerTest
 Key: STORM-3128
 URL: https://issues.apache.org/jira/browse/STORM-3128
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created and 
tries but failed to connect to zookeeper due to connection error. I'm not sure 
if this compromises the test even though it is passed after connection retry 
timeout. But it's nice to keep in mind.

{noformat}
2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
authenticate using SASL (unknown error)
2018-06-27 13:05:28.032 [main] INFO  
org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl - 
Default schema
2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for server 
null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
~[?:1.8.0_171]
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
~[?:1.8.0_171]
at 
org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
 [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-06-27 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525364#comment-16525364
 ] 

Zhengdai Hu commented on STORM-3128:


This issue is discovered when I tried to refactor the test

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Minor
> Fix For: 2.0.0
>
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-06-27 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525364#comment-16525364
 ] 

Zhengdai Hu edited comment on STORM-3128 at 6/27/18 6:17 PM:
-

I discovered the issue when trying to refactor the test


was (Author: zhengdai):
This issue is discovered when I tried to refactor the test

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Minor
> Fix For: 2.0.0
>
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3125) Refactoring methods in components for Supervisor and DRPC

2018-06-27 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3125:
---
Description: 
This is a supplement issue page to STORM-3099, separating out refactoring work 
from metrics addition.

A few misc bug discovered during refactoring have been incorporate in this 
issue as well. See links for more information.

  was:This is a supplement issue page to STORM-3099, separating out refactoring 
work from metrics addition.


> Refactoring methods in components for Supervisor and DRPC
> -
>
> Key: STORM-3125
> URL: https://issues.apache.org/jira/browse/STORM-3125
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> This is a supplement issue page to STORM-3099, separating out refactoring 
> work from metrics addition.
> A few misc bug discovered during refactoring have been incorporate in this 
> issue as well. See links for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3129) Worker state machine does not use correct time util to get start time

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3129:
--

 Summary: Worker state machine does not use correct time util to 
get start time
 Key: STORM-3129
 URL: https://issues.apache.org/jira/browse/STORM-3129
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Current implementation uses System.currentTimeMillis() instead of 
Time.currentTimeMillis() to get state start time. This may create problem in 
unit test as it uses simulated time controlled by Storm Time util.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3130) Add Timer registration and Timed object wrapper to Storm metrics util.

2018-06-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3130:
--

 Summary: Add Timer registration and Timed object wrapper to Storm 
metrics util.
 Key: STORM-3130
 URL: https://issues.apache.org/jira/browse/STORM-3130
 Project: Apache Storm
  Issue Type: New Feature
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


This allows us to time method running duration or variable/resource lifespan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3133) Extend metrics on Nimbus and LogViewer

2018-06-29 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3133:
--

 Summary: Extend metrics on Nimbus and LogViewer
 Key: STORM-3133
 URL: https://issues.apache.org/jira/browse/STORM-3133
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Include but not limited to

Logviewer

1. Clean-up time
2. Time to complete one clean up loop Time. 
3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?)
4. Failures/exceptions.
5. Search request Cnt: By category - Archived/non-archived
6. Search Request - Response time
7. Search Request - 0 result Cnt
8. Search Result - open files
9. File partial read count
10. File Download request Cnt/ And Size served
11. Disk IO by logviewer
12. CPU usage  ( for unzipping files)

Nimbus Additional:

- Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.
- Scheduler related metrics would be a long list generic and specific to 
different strategies.
- Most if not all cluster summary can be pushed as Metrics.
- Restart cnt
- Nimbus loss of leadership(?)
- UI not responding (https://jira.ouroath.com/browse/YSTORM-4838)
- Negative resource scheduling events 
(https://jira.ouroath.com/browse/YSTORM-4940)
- Excessive scheduling time (?)




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3140) Duplicated method?

2018-07-02 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3140:
--

 Summary: Duplicated method?
 Key: STORM-3140
 URL: https://issues.apache.org/jira/browse/STORM-3140
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


{code:java}
/**
 * Handles '/searchLogs' request.
 */
@GET
@Path("/searchLogs")
public Response searchLogs(@Context HttpServletRequest request) throws 
IOException {
String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}

/**
 * Handles '/listLogs' request.
 */
@GET
@Path("/listLogs")
public Response listLogs(@Context HttpServletRequest request) throws 
IOException {
meterListLogsHttpRequests.mark();

String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}{code}

These two methods have identical although they seem to serve different 
functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3140) Duplicated method in Logviewer REST API?

2018-07-02 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3140:
---
Description: 
{code:java}
/**
 * Handles '/searchLogs' request.
 */
@GET
@Path("/searchLogs")
public Response searchLogs(@Context HttpServletRequest request) throws 
IOException {
String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}

/**
 * Handles '/listLogs' request.
 */
@GET
@Path("/listLogs")
public Response listLogs(@Context HttpServletRequest request) throws 
IOException {
meterListLogsHttpRequests.mark();

String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}{code}

These two methods are identical although they seem to serve different functions.

  was:
{code:java}
/**
 * Handles '/searchLogs' request.
 */
@GET
@Path("/searchLogs")
public Response searchLogs(@Context HttpServletRequest request) throws 
IOException {
String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}

/**
 * Handles '/listLogs' request.
 */
@GET
@Path("/listLogs")
public Response listLogs(@Context HttpServletRequest request) throws 
IOException {
meterListLogsHttpRequests.mark();

String user = httpCredsHandler.getUserName(request);
String topologyId = request.getParameter("topoId");
String portStr = request.getParameter("port");
String callback = request.getParameter("callback");
String origin = request.getHeader("Origin");

return logviewer.listLogFiles(user, portStr != null ? 
Integer.parseInt(portStr) : null, topologyId, callback, origin);
}{code}

These two methods have identical although they seem to serve different 
functions.


> Duplicated method in Logviewer REST API?
> 
>
> Key: STORM-3140
> URL: https://issues.apache.org/jira/browse/STORM-3140
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> {code:java}
> /**
>  * Handles '/searchLogs' request.
>  */
> @GET
> @Path("/searchLogs")
> public Response searchLogs(@Context HttpServletRequest request) throws 
> IOException {
> String user = httpCredsHandler.getUserName(request);
> String topologyId = request.getParameter("topoId");
> String portStr = request.getParameter("port");
> String callback = request.getParameter("callback");
> String origin = request.getHeader("Origin");
> return logviewer.listLogFiles(user, portStr != null ? 
> Integer.parseInt(portStr) : null, topologyId, callback, origin);
> }
> /**
>  * Handles '/listLogs' request.
>  */
> @GET
> @Path("/listLogs")
> public Response listLogs(@Context HttpServletRequest request) throws 
> IOException {
> meterListLogsHttpRequests.mark();
> String user = httpCredsHandler.getUserName(request);
> String topologyId = request.getParameter("topoId");
> String portStr = request.getParameter("port");
> String callback = request.getParameter("callback");
> String origin = request.getHeader("Origin");
> return logviewer.listLogFiles(user, portStr != null ? 
> Integer.parseInt(portStr) : null, topologyId, callback, origin);
> }{code}
> These two methods are identical although they seem to serve different 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3140) Duplicated method in Logviewer REST API?

2018-07-02 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3140:
---
Summary: Duplicated method in Logviewer REST API?  (was: Duplicated method?)

> Duplicated method in Logviewer REST API?
> 
>
> Key: STORM-3140
> URL: https://issues.apache.org/jira/browse/STORM-3140
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> {code:java}
> /**
>  * Handles '/searchLogs' request.
>  */
> @GET
> @Path("/searchLogs")
> public Response searchLogs(@Context HttpServletRequest request) throws 
> IOException {
> String user = httpCredsHandler.getUserName(request);
> String topologyId = request.getParameter("topoId");
> String portStr = request.getParameter("port");
> String callback = request.getParameter("callback");
> String origin = request.getHeader("Origin");
> return logviewer.listLogFiles(user, portStr != null ? 
> Integer.parseInt(portStr) : null, topologyId, callback, origin);
> }
> /**
>  * Handles '/listLogs' request.
>  */
> @GET
> @Path("/listLogs")
> public Response listLogs(@Context HttpServletRequest request) throws 
> IOException {
> meterListLogsHttpRequests.mark();
> String user = httpCredsHandler.getUserName(request);
> String topologyId = request.getParameter("topoId");
> String portStr = request.getParameter("port");
> String callback = request.getParameter("callback");
> String origin = request.getHeader("Origin");
> return logviewer.listLogFiles(user, portStr != null ? 
> Integer.parseInt(portStr) : null, topologyId, callback, origin);
> }{code}
> These two methods have identical although they seem to serve different 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3143) Unnecessary inclusion of empty match result in Json

2018-07-05 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3143:
---
Summary: Unnecessary inclusion of empty match result in Json  (was: 
Unnecessary inclusion of empty string match result in Json)

> Unnecessary inclusion of empty match result in Json
> ---
>
> Key: STORM-3143
> URL: https://issues.apache.org/jira/browse/STORM-3143
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> `FindNMatches()` didn't correctly filter out empty match result in 
> `substringSearch()` and hence send back an empty map to user. I don't know if 
> this the desired behavior but a fix to current behavior will make metrics for 
> logviewer easier to implement. 
> An example of current behavior:
> {code:json}
> {
> "fileOffset": 1,
> "searchString": "sdf",
> "matches": [
> {
> "searchString": "sdf",
> "fileName": "word-count-1-1530815972/6701/worker.log",
> "matches": [],
> "port": "6701",
> "isDaemon": "no",
> "startByteOffset": 0
> }
> ]
> }
> {code}
> Desired behavior:
> {code:json}
> {
> "fileOffset": 1,
> "searchString": "sdf",
> "matches": []
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3143) Unnecessary inclusion of empty string match result in Json

2018-07-05 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3143:
--

 Summary: Unnecessary inclusion of empty string match result in Json
 Key: STORM-3143
 URL: https://issues.apache.org/jira/browse/STORM-3143
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


`FindNMatches()` didn't correctly filter out empty match result in 
`substringSearch()` and hence send back an empty map to user. I don't know if 
this the desired behavior but a fix to current behavior will make metrics for 
logviewer easier to implement. 

An example of current behavior:

{code:json}
{
"fileOffset": 1,
"searchString": "sdf",
"matches": [
{
"searchString": "sdf",
"fileName": "word-count-1-1530815972/6701/worker.log",
"matches": [],
"port": "6701",
"isDaemon": "no",
"startByteOffset": 0
}
]
}
{code}

Desired behavior:

{code:json}
{
"fileOffset": 1,
"searchString": "sdf",
"matches": []
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3144) Extend metrics on Nimbus

2018-07-05 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3144:
--

 Summary: Extend metrics on Nimbus
 Key: STORM-3144
 URL: https://issues.apache.org/jira/browse/STORM-3144
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Metrics include:
 # File upload time
 # Nimbus restart count
 # Nimbus loss of leadership: meter marking when a nimbus node gains or loses 
leadership
 # Excessive scheduling time (both duration distribution and current longest)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3133) Extend metrics on Nimbus and LogViewer

2018-07-05 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3133:
---
Description: 
Include but not limited to

Logviewer

1. Clean-up time
 2. Time to complete one clean up loop Time. 
 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?)
 4. Failures/exceptions.
 5. Search request Cnt: By category - Archived/non-archived
 6. Search Request - Response time
 7. Search Request - 0 result Cnt
 8. Search Result - open files
 9. File partial read count
 10. File Download request Cnt/ And Size served
 11. Disk IO by logviewer
 12. CPU usage ( for unzipping files)

-Nimbus Additional:-
 - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.-
 - -Scheduler related metrics would be a long list generic and specific to 
different strategies.-
 - -Most if not all cluster summary can be pushed as Metrics.-
 - -Restart cnt-
 - -Nimbus loss of leadership(?)-
 - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])-
 - -Negative resource scheduling events 
([https://jira.ouroath.com/browse/YSTORM-4940])-
 - -Excessive scheduling time (?)-

 

Nimbus metrics have been moved to another Jira page #3144

  was:
Include but not limited to

Logviewer

1. Clean-up time
2. Time to complete one clean up loop Time. 
3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?)
4. Failures/exceptions.
5. Search request Cnt: By category - Archived/non-archived
6. Search Request - Response time
7. Search Request - 0 result Cnt
8. Search Result - open files
9. File partial read count
10. File Download request Cnt/ And Size served
11. Disk IO by logviewer
12. CPU usage  ( for unzipping files)

Nimbus Additional:

- Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.
- Scheduler related metrics would be a long list generic and specific to 
different strategies.
- Most if not all cluster summary can be pushed as Metrics.
- Restart cnt
- Nimbus loss of leadership(?)
- UI not responding (https://jira.ouroath.com/browse/YSTORM-4838)
- Negative resource scheduling events 
(https://jira.ouroath.com/browse/YSTORM-4940)
- Excessive scheduling time (?)



> Extend metrics on Nimbus and LogViewer
> --
>
> Key: STORM-3133
> URL: https://issues.apache.org/jira/browse/STORM-3133
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Include but not limited to
> Logviewer
> 1. Clean-up time
>  2. Time to complete one clean up loop Time. 
>  3. Disk usage by logs before cleanup and After cleanup loop. ( Just like 
> GC.?)
>  4. Failures/exceptions.
>  5. Search request Cnt: By category - Archived/non-archived
>  6. Search Request - Response time
>  7. Search Request - 0 result Cnt
>  8. Search Result - open files
>  9. File partial read count
>  10. File Download request Cnt/ And Size served
>  11. Disk IO by logviewer
>  12. CPU usage ( for unzipping files)
> -Nimbus Additional:-
>  - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.-
>  - -Scheduler related metrics would be a long list generic and specific to 
> different strategies.-
>  - -Most if not all cluster summary can be pushed as Metrics.-
>  - -Restart cnt-
>  - -Nimbus loss of leadership(?)-
>  - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])-
>  - -Negative resource scheduling events 
> ([https://jira.ouroath.com/browse/YSTORM-4940])-
>  - -Excessive scheduling time (?)-
>  
> Nimbus metrics have been moved to another Jira page #3144



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3133) Extend metrics on Nimbus and LogViewer

2018-07-05 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3133:
---
Description: 
Include but not limited to

Logviewer

1. Clean-up time
 2. Time to complete one clean up loop Time. 
 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?)
 4. Failures/exceptions.
 5. Search request Cnt: By category - Archived/non-archived
 6. Search Request - Response time
 7. Search Request - 0 result Cnt
 8. Search Result - open files
 9. File partial read count
 10. File Download request Cnt/ And Size served
 11. Disk IO by logviewer
 12. CPU usage ( for unzipping files)

Nimbus Additional:
 * Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.
 * Scheduler related metrics would be a long list generic and specific to 
different strategies.
 * Most if not all cluster summary can be pushed as Metrics.
 * Restart cnt
 * Nimbus loss of leadership 
!/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle!
 * UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])
 * Negative resource scheduling events 
([https://jira.ouroath.com/browse/YSTORM-4940])
 * Excessive scheduling time  
!/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle!

  was:
Include but not limited to

Logviewer

1. Clean-up time
 2. Time to complete one clean up loop Time. 
 3. Disk usage by logs before cleanup and After cleanup loop. ( Just like GC.?)
 4. Failures/exceptions.
 5. Search request Cnt: By category - Archived/non-archived
 6. Search Request - Response time
 7. Search Request - 0 result Cnt
 8. Search Result - open files
 9. File partial read count
 10. File Download request Cnt/ And Size served
 11. Disk IO by logviewer
 12. CPU usage ( for unzipping files)

-Nimbus Additional:-
 - -Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.-
 - -Scheduler related metrics would be a long list generic and specific to 
different strategies.-
 - -Most if not all cluster summary can be pushed as Metrics.-
 - -Restart cnt-
 - -Nimbus loss of leadership(?)-
 - -UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])-
 - -Negative resource scheduling events 
([https://jira.ouroath.com/browse/YSTORM-4940])-
 - -Excessive scheduling time (?)-

 

Nimbus metrics have been moved to another Jira page #3144


> Extend metrics on Nimbus and LogViewer
> --
>
> Key: STORM-3133
> URL: https://issues.apache.org/jira/browse/STORM-3133
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Include but not limited to
> Logviewer
> 1. Clean-up time
>  2. Time to complete one clean up loop Time. 
>  3. Disk usage by logs before cleanup and After cleanup loop. ( Just like 
> GC.?)
>  4. Failures/exceptions.
>  5. Search request Cnt: By category - Archived/non-archived
>  6. Search Request - Response time
>  7. Search Request - 0 result Cnt
>  8. Search Result - open files
>  9. File partial read count
>  10. File Download request Cnt/ And Size served
>  11. Disk IO by logviewer
>  12. CPU usage ( for unzipping files)
> Nimbus Additional:
>  * Topology stormjar.ser/stormconf.ser/stormser.ser file upload time.
>  * Scheduler related metrics would be a long list generic and specific to 
> different strategies.
>  * Most if not all cluster summary can be pushed as Metrics.
>  * Restart cnt
>  * Nimbus loss of leadership 
> !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle!
>  * UI not responding ([https://jira.ouroath.com/browse/YSTORM-4838])
>  * Negative resource scheduling events 
> ([https://jira.ouroath.com/browse/YSTORM-4940])
>  * Excessive scheduling time  
> !/jira/images/icons/emoticons/help_16.png|width=16,height=16,align=absmiddle!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (STORM-3144) Extend metrics on Nimbus

2018-07-05 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu closed STORM-3144.
--
Resolution: Duplicate

> Extend metrics on Nimbus
> 
>
> Key: STORM-3144
> URL: https://issues.apache.org/jira/browse/STORM-3144
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> Metrics include:
>  # File upload time
>  # Nimbus restart count
>  # Nimbus loss of leadership: meter marking when a nimbus node gains or loses 
> leadership
>  # Excessive scheduling time (both duration distribution and current longest)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3147) Port ClusterSummary as Metrics to StormMetricsRegistry

2018-07-11 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3147:
--

 Summary: Port ClusterSummary as Metrics to StormMetricsRegistry
 Key: STORM-3147
 URL: https://issues.apache.org/jira/browse/STORM-3147
 Project: Apache Storm
  Issue Type: New Feature
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3150) Improve Gauge Registration in StormMetricsRegistry

2018-07-13 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3150:
--

 Summary: Improve Gauge Registration in StormMetricsRegistry
 Key: STORM-3150
 URL: https://issues.apache.org/jira/browse/STORM-3150
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Make #registerGauge and #registerProvidedGauge generic and clean up other code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3151) Negative Scheduling Resource/Overscheduling issue

2018-07-13 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3151:
--

 Summary: Negative Scheduling Resource/Overscheduling issue
 Key: STORM-3151
 URL: https://issues.apache.org/jira/browse/STORM-3151
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


Possible overscheduling captured when follow steps are performed

1) launch nimbus & zookeeper

2) launch supervisor 1

3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)

4) launch supervisor 2

5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)
{noformat}
2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 
176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
{noformat}

Indicating there may be issues inside scheduler.
It is discovered when I ported ClusterSummay to StormMetrics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3151) Negative Scheduling Resource/Overscheduling issue

2018-07-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3151:
---
Description: 
Possible overscheduling captured when follow steps are performed

1) launch nimbus & zookeeper

2) launch supervisor 1

3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)

4) launch supervisor 2

5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)

{noformat}
2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 
176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
{noformat}

Indicating there may be issues inside scheduler.
It is discovered when I ported ClusterSummay to StormMetrics

  was:
Possible overscheduling captured when follow steps are performed

1) launch nimbus & zookeeper

2) launch supervisor 1

3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)

4) launch supervisor 2

5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)
{noformat}
2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 
176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
{noformat}

Indicating there may be issues inside scheduler.
It is discovered when I ported ClusterSummay to StormMetrics


> Negative Scheduling Resource/Overscheduling issue
> -
>
> Key: STORM-3151
> URL: https://issues.apache.org/jira/browse/STORM-3151
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> Possible overscheduling captured when follow steps are performed
> 1) launch nimbus & zookeeper
> 2) launch supervisor 1
> 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)
> 4) launch supervisor 2
> 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)
> {noformat}
> 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled 
> on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
> {noformat}
> Indicating there may be issues inside scheduler.
> It is discovered when I ported ClusterSummay to StormMetrics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3151) Negative Scheduling Resource/Overscheduling issue

2018-07-13 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3151:
---
Description: 
Possible overscheduling captured when follow steps are performed (Logging is 
added in STORM-3147)

1) launch nimbus & zookeeper

2) launch supervisor 1

3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)

4) launch supervisor 2

5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)

{noformat}
2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 
176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
{noformat}

Indicating there may be issues inside scheduler.
It is discovered when I ported ClusterSummay to StormMetrics

  was:
Possible overscheduling captured when follow steps are performed

1) launch nimbus & zookeeper

2) launch supervisor 1

3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)

4) launch supervisor 2

5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)

{noformat}
2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled on 
176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
{noformat}

Indicating there may be issues inside scheduler.
It is discovered when I ported ClusterSummay to StormMetrics


> Negative Scheduling Resource/Overscheduling issue
> -
>
> Key: STORM-3151
> URL: https://issues.apache.org/jira/browse/STORM-3151
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> Possible overscheduling captured when follow steps are performed (Logging is 
> added in STORM-3147)
> 1) launch nimbus & zookeeper
> 2) launch supervisor 1
> 3) launch topology 1 (I used org.apache.storm.starter.WordCountTopology)
> 4) launch supervisor 2
> 5) launch topology 2 (I used org.apache.storm.starter.ExclamationTopology)
> {noformat}
> 2018-07-13 12:58:43.196 o.a.s.d.n.Nimbus timer [WARN] Memory over-scheduled 
> on 176ec6d4-2df3-40ca-95ca-c84a81dbcc22-172.130.97.212
> {noformat}
> Indicating there may be issues inside scheduler.
> It is discovered when I ported ClusterSummay to StormMetrics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3157) General improvement to StormMetricsRegistry

2018-07-20 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3157:
--

 Summary: General improvement to StormMetricsRegistry
 Key: STORM-3157
 URL: https://issues.apache.org/jira/browse/STORM-3157
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


The solution contains general improvement and clean up to StormMetricsRegistry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3157) General improvement to StormMetricsRegistry

2018-07-20 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3157:
---
Description: The solution contains general improvement and clean up to 
StormMetricsRegistry. Therefore this may affect all current and future changes 
to server-side metrics  (was: The solution contains general improvement and 
clean up to StormMetricsRegistry.)

> General improvement to StormMetricsRegistry
> ---
>
> Key: STORM-3157
> URL: https://issues.apache.org/jira/browse/STORM-3157
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> The solution contains general improvement and clean up to 
> StormMetricsRegistry. Therefore this may affect all current and future 
> changes to server-side metrics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3159) Fixed potential file resource leak

2018-07-24 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3159:
--

 Summary: Fixed potential file resource leak
 Key: STORM-3159
 URL: https://issues.apache.org/jira/browse/STORM-3159
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


`zipFileSize()` in ServerUtils is not correctly wrapped in try-with-resource 
block, which could lead to resource leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3162) Race condition at updateHeartbeatCache

2018-07-27 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3162:
--

 Summary: Race condition at updateHeartbeatCache
 Key: STORM-3162
 URL: https://issues.apache.org/jira/browse/STORM-3162
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-client, storm-core, storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


This is discovered during testing for STORM-3133. Travis-CI log can be found 
[here|https://travis-ci.org/apache/storm/jobs/408719153].

Specifically, updateHeartbeatCache can be invoked both by Nimbus (at 
`Nimbus#updateHeartBeats`) and by Supervisor (at 
`Nimbubs#updateCachedHeartbeatsFromWorker` at 
`Nimbus#updateCachedHeartbeatsFromSupervisor`), causing 
ConcurrentModificationException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3162) Race condition at updateHeartbeatCache

2018-07-27 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3162:
---
Description: 
This is discovered during testing for STORM-3133. Travis-CI log can be found 
[here|https://travis-ci.org/apache/storm/jobs/408719153#L1897].

Specifically, updateHeartbeatCache can be invoked both by Nimbus (at 
`Nimbus#updateHeartBeats`) and by Supervisor (at 
`Nimbubs#updateCachedHeartbeatsFromWorker` at 
`Nimbus#updateCachedHeartbeatsFromSupervisor`), causing 
ConcurrentModificationException.

  was:
This is discovered during testing for STORM-3133. Travis-CI log can be found 
[here|https://travis-ci.org/apache/storm/jobs/408719153].

Specifically, updateHeartbeatCache can be invoked both by Nimbus (at 
`Nimbus#updateHeartBeats`) and by Supervisor (at 
`Nimbubs#updateCachedHeartbeatsFromWorker` at 
`Nimbus#updateCachedHeartbeatsFromSupervisor`), causing 
ConcurrentModificationException.


> Race condition at updateHeartbeatCache
> --
>
> Key: STORM-3162
> URL: https://issues.apache.org/jira/browse/STORM-3162
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-client, storm-core, storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is discovered during testing for STORM-3133. Travis-CI log can be found 
> [here|https://travis-ci.org/apache/storm/jobs/408719153#L1897].
> Specifically, updateHeartbeatCache can be invoked both by Nimbus (at 
> `Nimbus#updateHeartBeats`) and by Supervisor (at 
> `Nimbubs#updateCachedHeartbeatsFromWorker` at 
> `Nimbus#updateCachedHeartbeatsFromSupervisor`), causing 
> ConcurrentModificationException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3169) Misleading logviewer.cleanup.age.min

2018-08-01 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3169:
--

 Summary: Misleading logviewer.cleanup.age.min
 Key: STORM-3169
 URL: https://issues.apache.org/jira/browse/STORM-3169
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


Config specification logviewer.cleanup.age.min labels the duration in minutes 
passed since a log file is modified before we consider the log to be old. 
However in the actual use it's been subtracted by nowMills, which is the 
current time in milliseconds. We should convert the minutes to millisecond for 
it to function correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files

2018-08-01 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3170:
--

 Summary: DirectoryCleaner may not correctly report correct number 
of deleted files
 Key: STORM-3170
 URL: https://issues.apache.org/jira/browse/STORM-3170
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


The original implementation calls file#delete without checking if it succeed or 
not, even though they're always reported as deleted. This invalidate any 
metrics built on top of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files

2018-08-01 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3170:
---
Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original 
implementation calls file#delete without checking if it succeed or not, even 
though they're always reported as deleted. This invalidate any metrics built on 
top of this.  (was: The original implementation calls file#delete without 
checking if it succeed or not, even though they're always reported as deleted. 
This invalidate any metrics built on top of this.)

> DirectoryCleaner may not correctly report correct number of deleted files
> -
>
> Key: STORM-3170
> URL: https://issues.apache.org/jira/browse/STORM-3170
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
> calls file#delete without checking if it succeed or not, even though they're 
> always reported as deleted. This invalidate any metrics built on top of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3169) Misleading logviewer.cleanup.age.min

2018-08-01 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3169:
---
Description: Config specification logviewer.cleanup.age.min labels the 
duration in minutes passed since a log file is modified before we consider the 
log to be old. However in the actual use it's been subtracted by nowMills, 
which is the current time in milliseconds. We should convert it to 
milliseconds.  (was: Config specification logviewer.cleanup.age.min labels the 
duration in minutes passed since a log file is modified before we consider the 
log to be old. However in the actual use it's been subtracted by nowMills, 
which is the current time in milliseconds. We should convert the minutes to 
millisecond for it to function correctly.)

> Misleading logviewer.cleanup.age.min
> 
>
> Key: STORM-3169
> URL: https://issues.apache.org/jira/browse/STORM-3169
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Config specification logviewer.cleanup.age.min labels the duration in minutes 
> passed since a log file is modified before we consider the log to be old. 
> However in the actual use it's been subtracted by nowMills, which is the 
> current time in milliseconds. We should convert it to milliseconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files

2018-08-01 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3170:
---
Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original 
implementation calls file#delete without checking if it succeed or not, even 
though they're always reported as deleted. This prevents DirectoryCleaner from 
clean up other files and invalidates any metrics built on top of this.  (was: 
In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
calls file#delete without checking if it succeed or not, even though they're 
always reported as deleted. This invalidate any metrics built on top of this.)

> DirectoryCleaner may not correctly report correct number of deleted files
> -
>
> Key: STORM-3170
> URL: https://issues.apache.org/jira/browse/STORM-3170
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
> Fix For: 2.0.0
>
>
> In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
> calls file#delete without checking if it succeed or not, even though they're 
> always reported as deleted. This prevents DirectoryCleaner from clean up 
> other files and invalidates any metrics built on top of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files

2018-08-01 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3170:
---
Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original 
implementation calls file#delete without checking if it succeeds or not, and 
they're always reported as deleted. This prevents DirectoryCleaner from clean 
up other files and invalidates any metrics built on top of this.  (was: In 
DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation calls 
file#delete without checking if it succeeds or not, even though they're always 
reported as deleted. This prevents DirectoryCleaner from clean up other files 
and invalidates any metrics built on top of this.)

> DirectoryCleaner may not correctly report correct number of deleted files
> -
>
> Key: STORM-3170
> URL: https://issues.apache.org/jira/browse/STORM-3170
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
> calls file#delete without checking if it succeeds or not, and they're always 
> reported as deleted. This prevents DirectoryCleaner from clean up other files 
> and invalidates any metrics built on top of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3170) DirectoryCleaner may not correctly report correct number of deleted files

2018-08-01 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3170:
---
Description: In DirectoryCleaner#deleteOldestWhileTooLarge, the original 
implementation calls file#delete without checking if it succeeds or not, even 
though they're always reported as deleted. This prevents DirectoryCleaner from 
clean up other files and invalidates any metrics built on top of this.  (was: 
In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
calls file#delete without checking if it succeed or not, even though they're 
always reported as deleted. This prevents DirectoryCleaner from clean up other 
files and invalidates any metrics built on top of this.)

> DirectoryCleaner may not correctly report correct number of deleted files
> -
>
> Key: STORM-3170
> URL: https://issues.apache.org/jira/browse/STORM-3170
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In DirectoryCleaner#deleteOldestWhileTooLarge, the original implementation 
> calls file#delete without checking if it succeeds or not, even though they're 
> always reported as deleted. This prevents DirectoryCleaner from clean up 
> other files and invalidates any metrics built on top of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3173) flush metrics to Yamas on shutdown

2018-08-02 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3173:
--

 Summary: flush metrics to Yamas on shutdown
 Key: STORM-3173
 URL: https://issues.apache.org/jira/browse/STORM-3173
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server, storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Assignee: Zhengdai Hu
 Fix For: 2.0.0


We lose shutdown related metrics that we should alert on at shutdown. We should 
flush metrics on a shutdown.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3173) flush metrics to Yamas on shutdown

2018-08-02 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3173:
---
Description: 
We lose shutdown related metrics that we should alert on at shutdown. We should 
flush metrics on a shutdown.

https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497

  was:
We lose shutdown related metrics that we should alert on at shutdown. We should 
flush metrics on a shutdown.




> flush metrics to Yamas on shutdown
> --
>
> Key: STORM-3173
> URL: https://issues.apache.org/jira/browse/STORM-3173
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server, storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Minor
> Fix For: 2.0.0
>
>
> We lose shutdown related metrics that we should alert on at shutdown. We 
> should flush metrics on a shutdown.
> https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3173) flush metrics to ScheduledReporter on shutdown

2018-08-02 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3173:
---
Summary: flush metrics to ScheduledReporter on shutdown  (was: flush 
metrics to Yamas on shutdown)

> flush metrics to ScheduledReporter on shutdown
> --
>
> Key: STORM-3173
> URL: https://issues.apache.org/jira/browse/STORM-3173
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server, storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Assignee: Zhengdai Hu
>Priority: Minor
> Fix For: 2.0.0
>
>
> We lose shutdown related metrics that we should alert on at shutdown. We 
> should flush metrics on a shutdown.
> https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4497



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3177) MockRemovableFile returns true on `#exists` even after `#delete` is called.

2018-08-03 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3177:
--

 Summary: MockRemovableFile returns true on `#exists` even after 
`#delete` is called.
 Key: STORM-3177
 URL: https://issues.apache.org/jira/browse/STORM-3177
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
 Fix For: 2.0.0


See conversation in 
https://github.com/apache/storm/pull/2788#pullrequestreview-142918985



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3178) Decouple `ClientSupervisorUtils` and refactor metrics registration

2018-08-03 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3178:
--

 Summary: Decouple `ClientSupervisorUtils` and refactor metrics 
registration
 Key: STORM-3178
 URL: https://issues.apache.org/jira/browse/STORM-3178
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-client, storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


See conversation https://github.com/apache/storm/pull/2710#discussion_r207576736



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3177) MockRemovableFile returns true on `#exists` even after `#delete` is called.

2018-08-03 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3177:
---
Fix Version/s: (was: 2.0.0)

> MockRemovableFile returns true on `#exists` even after `#delete` is called.
> ---
>
> Key: STORM-3177
> URL: https://issues.apache.org/jira/browse/STORM-3177
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Minor
>
> See conversation in 
> https://github.com/apache/storm/pull/2788#pullrequestreview-142918985



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-06 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3128:
---
Fix Version/s: (was: 2.0.0)

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Minor
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-06 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3128:
---
Priority: Major  (was: Minor)

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-07 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572180#comment-16572180
 ] 

Zhengdai Hu commented on STORM-3128:


It looks like that not all Zookeeper calls are stubbed correctly. Exception are 
thrown when LocalFsBlobStore#prepare gets called, which calls 
BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then 
issue connection to zookeeper under the hood. I guess that the exception is 
suppressed implicates that this is a common issue, but may be hard to fix.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-07 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572180#comment-16572180
 ] 

Zhengdai Hu edited comment on STORM-3128 at 8/7/18 7:10 PM:


It looks like that not all Zookeeper calls are stubbed correctly. Exception are 
thrown when LocalFsBlobStore#prepare gets called, which calls 
BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then 
issue connection to zookeeper under the hood. I guess that the exception is 
suppressed implicates that this is a common issue, but may be hard to fix. 
[~Srdo]


was (Author: zhengdai):
It looks like that not all Zookeeper calls are stubbed correctly. Exception are 
thrown when LocalFsBlobStore#prepare gets called, which calls 
BlobStoreUtils.createZKClient and ClusterUtils.mkStormClusterState, which then 
issue connection to zookeeper under the hood. I guess that the exception is 
suppressed implicates that this is a common issue, but may be hard to fix.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-07 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572193#comment-16572193
 ] 

Zhengdai Hu commented on STORM-3128:


It doesn't, but it's likely to cause a timeout error. If I remembered 
correctly, StormTimer terminates the process in case of a time out, with code 
20, "Error when processing event", which was what I saw in the test failure 
message.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-07 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572193#comment-16572193
 ] 

Zhengdai Hu edited comment on STORM-3128 at 8/7/18 7:25 PM:


It doesn't, but it's likely to cause a timeout error. If I remembered 
correctly, StormTimer terminates the process in case of a time out, with code 
20, "Error when processing event", which was what I saw in the test failure 
message. [~Srdo]


was (Author: zhengdai):
It doesn't, but it's likely to cause a timeout error. If I remembered 
correctly, StormTimer terminates the process in case of a time out, with code 
20, "Error when processing event", which was what I saw in the test failure 
message.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-07 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572202#comment-16572202
 ] 

Zhengdai Hu commented on STORM-3128:


I might be wrong. Let me take a look at StormTimer again.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-08 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3128:
---
Description: 
In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created and 
tries but failed to connect to zookeeper due to connection error. I'm not sure 
if this compromises the test even though it is passed after connection retry 
timeout. But it's nice to keep in mind.

{noformat}
2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
authenticate using SASL (unknown error)
2018-06-27 13:05:28.032 [main] INFO  
org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl - 
Default schema
2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for server 
null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
~[?:1.8.0_171]
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
~[?:1.8.0_171]
at 
org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
 [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
{noformat}

I managed to track down the source where the exception is thrown, but it's 
really strange that this is called by a StormTimer inside Supervisor, which is 
not declared anywhere in this test. I'm completely lost by now.


{noformat}
2018-08-08 11:45:30.217 [heartbeatTimer] ERROR 
org.apache.storm.zookeeper.ClientZookeeper - e: {}
org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /supervisors
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
 ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:145) 
[storm-client-2.0.0-SNAPSHOT.jar:?]
at 
org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:292) 
[storm-client-2.0.0-SNAPSHOT.jar:?]
at 
org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70) 
[storm-client-2.0.0-SNAPSHOT.jar:?]
at 
org.apache.storm.cluster.ZKStateStorage.set_ephemeral_node(ZKStateStorage.java:129)
 [storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.cluster.StormClusterStateImpl.supervisorHeartbeat(StormClusterStateImpl.java:522)
 [storm-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at 
org.apache.storm.daemon.supervisor.timer.SupervisorHeartbeat.run(SupervisorHeartbeat.java:96)
 [classes/:?]
at org.apache.storm.StormTimer$1.run(StormTimer.java:110) 
[storm-client-2.0.0-SNAPSHOT.jar:?]
at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) 
[st

[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-08 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502
 ] 

Zhengdai Hu edited comment on STORM-3128 at 8/8/18 4:59 PM:


I added the stack trace of the exception that crashed the test. It looks like 
other build failure are likely to be caused by the same error. It's really 
strange though. [~Srdo]


was (Author: zhengdai):
I added the stack trace of the exception that crashed the test. It looks like 
other build failure are likely to be caused by the same error. It's really 
strange though.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}
> I managed to track down the source where the exception is thrown, but it's 
> really strange that this is called by a StormTimer inside Supervisor, which 
> is not declared anywhere in this test. I'm completely lost by now.
> {noformat}
> 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR 
> org.apache.storm.zookeeper.ClientZookeeper - e: {}
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
>  KeeperErrorCode = ConnectionLoss for /supervisors
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>

[jira] [Comment Edited] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-08 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502
 ] 

Zhengdai Hu edited comment on STORM-3128 at 8/8/18 4:59 PM:


I added the stack trace of the exception that crashed the test. It looks like 
other build failures are likely to be caused by the same error. It's really 
strange though. [~Srdo]


was (Author: zhengdai):
I added the stack trace of the exception that crashed the test. It looks like 
other build failure are likely to be caused by the same error. It's really 
strange though. [~Srdo]

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}
> I managed to track down the source where the exception is thrown, but it's 
> really strange that this is called by a StormTimer inside Supervisor, which 
> is not declared anywhere in this test. I'm completely lost by now.
> {noformat}
> 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR 
> org.apache.storm.zookeeper.ClientZookeeper - e: {}
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
>  KeeperErrorCode = ConnectionLoss for /supervisors
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNA

[jira] [Commented] (STORM-3128) Connection refused error in AsyncLocalizerTest

2018-08-08 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573502#comment-16573502
 ] 

Zhengdai Hu commented on STORM-3128:


I added the stack trace of the exception that crashed the test. It looks like 
other build failure are likely to be caused by the same error. It's really 
strange though.

> Connection refused error in AsyncLocalizerTest
> --
>
> Key: STORM-3128
> URL: https://issues.apache.org/jira/browse/STORM-3128
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In AsyncLocalizerTest testKeyNotFoundException, a localBlobStore is created 
> and tries but failed to connect to zookeeper due to connection error. I'm not 
> sure if this compromises the test even though it is passed after connection 
> retry timeout. But it's nice to keep in mind.
> {noformat}
> 2018-06-27 13:05:28.005 [main-SendThread(localhost:2181)] INFO  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Opening socket 
> connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2018-06-27 13:05:28.032 [main] INFO  
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl 
> - Default schema
> 2018-06-27 13:05:28.035 [main-SendThread(localhost:2181)] WARN  
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn - Session 0x0 for 
> server null, unexpected error, closing socket connection and attempting 
> reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
> ~[?:1.8.0_171]
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
> ~[?:1.8.0_171]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
>   at 
> org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>  [shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> {noformat}
> I managed to track down the source where the exception is thrown, but it's 
> really strange that this is called by a StormTimer inside Supervisor, which 
> is not declared anywhere in this test. I'm completely lost by now.
> {noformat}
> 2018-08-08 11:45:30.217 [heartbeatTimer] ERROR 
> org.apache.storm.zookeeper.ClientZookeeper - e: {}
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$ConnectionLossException:
>  KeeperErrorCode = ConnectionLoss for /supervisors
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:268)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl$3.call(ExistsBuilderImpl.java:257)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForegroundStandard(ExistsBuilderImpl.java:254)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:247)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:206)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
>  ~[shaded-deps-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
> at 
> org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:145)
>  [storm-client-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:292)
>  [storm-client-2.0.0-

[jira] [Created] (STORM-3186) Customizable configuration for metric reporting interval

2018-08-09 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3186:
--

 Summary: Customizable configuration for metric reporting interval
 Key: STORM-3186
 URL: https://issues.apache.org/jira/browse/STORM-3186
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server, storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


In current implementation, all subclass of ScheduledReporter are hard coded 
report interval of 10 seconds. However I think it would make sense to make this 
an item in configuration so user can change the reporting frequency to fit 
their needs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3186) Customizable configuration for metric reporting interval

2018-08-09 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3186:
---
Description: 
In current implementation, all subclass of ScheduledReporter are hard coded 
report interval of 10 seconds. However I think it would make sense to make this 
an item in configuration so user can change the reporting frequency to fit 
their needs.

See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617

  was:In current implementation, all subclass of ScheduledReporter are hard 
coded report interval of 10 seconds. However I think it would make sense to 
make this an item in configuration so user can change the reporting frequency 
to fit their needs.


> Customizable configuration for metric reporting interval
> 
>
> Key: STORM-3186
> URL: https://issues.apache.org/jira/browse/STORM-3186
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server, storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In current implementation, all subclass of ScheduledReporter are hard coded 
> report interval of 10 seconds. However I think it would make sense to make 
> this an item in configuration so user can change the reporting frequency to 
> fit their needs.
> See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3186) Customizable configuration for metric reporting interval

2018-08-09 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575004#comment-16575004
 ] 

Zhengdai Hu commented on STORM-3186:


See discussion https://github.com/apache/storm/pull/2764#discussion_r203726617

> Customizable configuration for metric reporting interval
> 
>
> Key: STORM-3186
> URL: https://issues.apache.org/jira/browse/STORM-3186
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server, storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> In current implementation, all subclass of ScheduledReporter are hard coded 
> report interval of 10 seconds. However I think it would make sense to make 
> this an item in configuration so user can change the reporting frequency to 
> fit their needs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3187) Nimbus code refactoring and cleanup

2018-08-09 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3187:
--

 Summary: Nimbus code refactoring and cleanup
 Key: STORM-3187
 URL: https://issues.apache.org/jira/browse/STORM-3187
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


Nimbus.java is bloated with many legacy code that are convoluted and 
inefficient. It would be nice if we can clean up the code a bit, especially now 
that we're moving away from Clojure.

Several suggestion are made in STORM-3133, including,

1. Remove logging that is of the same purpose of some metrics: 
https://github.com/apache/storm/pull/2764#discussion_r203727117

2. Refactor data type of return values/parameters to improve readability: 
https://github.com/apache/storm/pull/2764#discussion_r208699933
https://github.com/apache/storm/pull/2764#discussion_r208721202
https://github.com/apache/storm/pull/2764#discussion-diff-208707855R2230



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3187) Nimbus code refactoring and cleanup

2018-08-09 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3187:
---
Description: 
Nimbus.java is bloated with many legacy code that are convoluted and 
inefficient. It would be nice if we can clean up the code a bit, especially now 
that we're moving away from Clojure.

Several suggestion are made in STORM-3133, including,

1. Remove logging that is of the same purpose of some metrics: 
https://github.com/apache/storm/pull/2764#discussion_r203727117

2. Refactor data type of return values/parameters to improve readability: 
https://github.com/apache/storm/pull/2764#discussion_r208699933
https://github.com/apache/storm/pull/2764#discussion_r208721202
https://github.com/apache/storm/pull/2764#discussion_r208707855

3. Other performance improvement
https://github.com/apache/storm/pull/2764#discussion_r208714561

  was:
Nimbus.java is bloated with many legacy code that are convoluted and 
inefficient. It would be nice if we can clean up the code a bit, especially now 
that we're moving away from Clojure.

Several suggestion are made in STORM-3133, including,

1. Remove logging that is of the same purpose of some metrics: 
https://github.com/apache/storm/pull/2764#discussion_r203727117

2. Refactor data type of return values/parameters to improve readability: 
https://github.com/apache/storm/pull/2764#discussion_r208699933
https://github.com/apache/storm/pull/2764#discussion_r208721202
https://github.com/apache/storm/pull/2764#discussion-diff-208707855R2230


> Nimbus code refactoring and cleanup
> ---
>
> Key: STORM-3187
> URL: https://issues.apache.org/jira/browse/STORM-3187
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-server
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> Nimbus.java is bloated with many legacy code that are convoluted and 
> inefficient. It would be nice if we can clean up the code a bit, especially 
> now that we're moving away from Clojure.
> Several suggestion are made in STORM-3133, including,
> 1. Remove logging that is of the same purpose of some metrics: 
> https://github.com/apache/storm/pull/2764#discussion_r203727117
> 2. Refactor data type of return values/parameters to improve readability: 
> https://github.com/apache/storm/pull/2764#discussion_r208699933
> https://github.com/apache/storm/pull/2764#discussion_r208721202
> https://github.com/apache/storm/pull/2764#discussion_r208707855
> 3. Other performance improvement
> https://github.com/apache/storm/pull/2764#discussion_r208714561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3188) Removing try-catch block from getAndResetWorkerHeartbeats

2018-08-09 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3188:
--

 Summary: Removing try-catch block from getAndResetWorkerHeartbeats
 Key: STORM-3188
 URL: https://issues.apache.org/jira/browse/STORM-3188
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


After refactoring, SupervisorUtils.readWorkerHeartbeats no longer throws 
checked Exceptions. I'm wondering if we still want to keep the try-catch block 
to wrap around its invocation in getAndResetWorkerHeartbeats in 
ReportWorkerHeartbeats.java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3189) Remove unused data file LogViewer api

2018-08-09 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3189:
--

 Summary: Remove unused data file LogViewer api
 Key: STORM-3189
 URL: https://issues.apache.org/jira/browse/STORM-3189
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-webapp
Affects Versions: 2.0.0
Reporter: Zhengdai Hu


Discovered in STORM-3133.

`findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which 
contains a field `fileOffset`. However, in current implementation, `fileOffset` 
behaves a bit odd and is not being used anywhere in the app. I'm wondering if 
we should remove this field altogether

Specifically, the difference in behavior follows,
`fileOffset is passed in as the desired amount of file to skip in search (equiv 
to index of first file to search)

if desired amount of matches found, fileOffset will be the index of last 
scanned file (starting from 0).
if not enough matches found in all logs, fileOffset will be number of all logs 
(equiv to one past the index of last file)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3189) Remove unused data file LogViewer api

2018-08-09 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3189:
---
Description: 
Discovered in STORM-3133.

`findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which 
contains a field `fileOffset`. However, in current implementation, `fileOffset` 
behaves a bit odd and is not being used anywhere in the app. I'm wondering if 
we should remove this field altogether

Specifically, the difference in behavior follows,
`fileOffset is passed in as the desired amount of file to skip in search (equiv 
to index of first file to search)

if desired amount of matches found, fileOffset will be the index of last 
scanned file (starting from 0).
if not enough matches found in all logs, fileOffset will be number of all logs 
(equiv to one past the index of last file)

See 
https://github.com/apache/storm/pull/2754#discussion_r208691016
https://github.com/apache/storm/pull/2754#discussion_r208726809

  was:
Discovered in STORM-3133.

`findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which 
contains a field `fileOffset`. However, in current implementation, `fileOffset` 
behaves a bit odd and is not being used anywhere in the app. I'm wondering if 
we should remove this field altogether

Specifically, the difference in behavior follows,
`fileOffset is passed in as the desired amount of file to skip in search (equiv 
to index of first file to search)

if desired amount of matches found, fileOffset will be the index of last 
scanned file (starting from 0).
if not enough matches found in all logs, fileOffset will be number of all logs 
(equiv to one past the index of last file)


> Remove unused data file LogViewer api
> -
>
> Key: STORM-3189
> URL: https://issues.apache.org/jira/browse/STORM-3189
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Zhengdai Hu
>Priority: Major
>
> Discovered in STORM-3133.
> `findNMatches` in LogviewerLogSearchHandler returns a `Matched` object which 
> contains a field `fileOffset`. However, in current implementation, 
> `fileOffset` behaves a bit odd and is not being used anywhere in the app. I'm 
> wondering if we should remove this field altogether
> Specifically, the difference in behavior follows,
> `fileOffset is passed in as the desired amount of file to skip in search 
> (equiv to index of first file to search)
> if desired amount of matches found, fileOffset will be the index of last 
> scanned file (starting from 0).
> if not enough matches found in all logs, fileOffset will be number of all 
> logs (equiv to one past the index of last file)
> See 
> https://github.com/apache/storm/pull/2754#discussion_r208691016
> https://github.com/apache/storm/pull/2754#discussion_r208726809



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (STORM-3190) Unnecessary null check of directory stream in LogCleaner

2018-08-10 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576392#comment-16576392
 ] 

Zhengdai Hu commented on STORM-3190:


We can further simplify this.


{code:java}
private long lastModifiedTimeWorkerLogdir(File logDir) {
long dirModified = logDir.lastModified();
try (DirectoryStream dirStream = 
directoryCleaner.getStreamForDirectory(logDir)) {
if (dirStream != null) {
try {
return StreamSupport.stream(dirStream.spliterator(), false)
.reduce(dirModified, (maximum, path) -> {
long curr = path.toFile().lastModified();
return curr > maximum ? curr : maximum;
}, BinaryOperator.maxBy(Long::compareTo));
} catch (Exception ex) {
LOG.error(ex.getMessage(), ex);
}
}
} catch (IOException ignored) {}
return dirModified;
}
{code}


> Unnecessary null check of directory stream in LogCleaner
> 
>
> Key: STORM-3190
> URL: https://issues.apache.org/jira/browse/STORM-3190
> Project: Apache Storm
>  Issue Type: Task
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Stig Rohde Døssing
>Priority: Trivial
>
> This should be using try-with-resources 
> https://github.com/apache/storm/blob/a1b3e02aab57b4e458b8b5763a0d467852906bb7/storm-webapp/src/main/java/org/apache/storm/daemon/logviewer/utils/LogCleaner.java#L263



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (STORM-3190) Unnecessary null check of directory stream in LogCleaner

2018-08-10 Thread Zhengdai Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/STORM-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576392#comment-16576392
 ] 

Zhengdai Hu edited comment on STORM-3190 at 8/10/18 3:18 PM:
-

We can further simplify this.


{code:java}
private long lastModifiedTimeWorkerLogdir(File logDir) {
long dirModified = logDir.lastModified();
try (DirectoryStream dirStream = 
directoryCleaner.getStreamForDirectory(logDir)) {
if (dirStream != null) {
try {
return StreamSupport.stream(dirStream.spliterator(), false)
.mapToLong(path -> path.toFile().lastModified())
.reduce(dirModified, Math::max);
} catch (Exception ex) {
LOG.error(ex.getMessage(), ex);
}
}
} catch (IOException ignored) {}
return dirModified;
}
{code}



was (Author: zhengdai):
We can further simplify this.


{code:java}
private long lastModifiedTimeWorkerLogdir(File logDir) {
long dirModified = logDir.lastModified();
try (DirectoryStream dirStream = 
directoryCleaner.getStreamForDirectory(logDir)) {
if (dirStream != null) {
try {
return StreamSupport.stream(dirStream.spliterator(), false)
.reduce(dirModified, (maximum, path) -> {
long curr = path.toFile().lastModified();
return curr > maximum ? curr : maximum;
}, BinaryOperator.maxBy(Long::compareTo));
} catch (Exception ex) {
LOG.error(ex.getMessage(), ex);
}
}
} catch (IOException ignored) {}
return dirModified;
}
{code}


> Unnecessary null check of directory stream in LogCleaner
> 
>
> Key: STORM-3190
> URL: https://issues.apache.org/jira/browse/STORM-3190
> Project: Apache Storm
>  Issue Type: Task
>  Components: storm-webapp
>Affects Versions: 2.0.0
>Reporter: Stig Rohde Døssing
>Priority: Trivial
>
> This should be using try-with-resources 
> https://github.com/apache/storm/blob/a1b3e02aab57b4e458b8b5763a0d467852906bb7/storm-webapp/src/main/java/org/apache/storm/daemon/logviewer/utils/LogCleaner.java#L263



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3191) Migrate more items from

2018-08-10 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3191:
--

 Summary: Migrate more items from 
 Key: STORM-3191
 URL: https://issues.apache.org/jira/browse/STORM-3191
 Project: Apache Storm
  Issue Type: Improvement
Reporter: Zhengdai Hu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics

2018-08-10 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3191:
---
Summary: Migrate more items from ClusterSummary to metrics  (was: Migrate 
more items from )

> Migrate more items from ClusterSummary to metrics
> -
>
> Key: STORM-3191
> URL: https://issues.apache.org/jira/browse/STORM-3191
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Zhengdai Hu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics

2018-08-10 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3191:
---
Priority: Minor  (was: Major)

> Migrate more items from ClusterSummary to metrics
> -
>
> Key: STORM-3191
> URL: https://issues.apache.org/jira/browse/STORM-3191
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Zhengdai Hu
>Priority: Minor
>
> The following summary items haven't been ported as nimbus metrics yet. 
> //Declared in StormConf. I don't see the value in reporting so.
> SUPERVISOR_TOTAL_RESOURCE,
> //May be able to aggregate based on status;
> TOPOLOGY_STATUS,
> TOPOLOGY_SCHED_STATUS,
> //May be aggregated: e.g., distinct values
> NUM_DISTINCT_NIMBUS_VERSION;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics

2018-08-10 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3191:
---
Description: 
The following summary items haven't been ported as nimbus metrics yet. 

//Declared in StormConf. I don't see the value in reporting so.
SUPERVISOR_TOTAL_RESOURCE,

//May be able to aggregate based on status;
TOPOLOGY_STATUS,
TOPOLOGY_SCHED_STATUS,

//May be aggregated: e.g., distinct values
NUM_DISTINCT_NIMBUS_VERSION;


> Migrate more items from ClusterSummary to metrics
> -
>
> Key: STORM-3191
> URL: https://issues.apache.org/jira/browse/STORM-3191
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Zhengdai Hu
>Priority: Major
>
> The following summary items haven't been ported as nimbus metrics yet. 
> //Declared in StormConf. I don't see the value in reporting so.
> SUPERVISOR_TOTAL_RESOURCE,
> //May be able to aggregate based on status;
> TOPOLOGY_STATUS,
> TOPOLOGY_SCHED_STATUS,
> //May be aggregated: e.g., distinct values
> NUM_DISTINCT_NIMBUS_VERSION;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (STORM-3191) Migrate more items from ClusterSummary to metrics

2018-08-10 Thread Zhengdai Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/STORM-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhengdai Hu updated STORM-3191:
---
Description: 
The following summary items haven't been ported as nimbus metrics yet.

//Declared in StormConf. I don't see the value in reporting so.
SUPERVISOR_TOTAL_RESOURCE,

//May be able to aggregate based on status;
TOPOLOGY_STATUS,
TOPOLOGY_SCHED_STATUS,

//May be aggregated: e.g., distinct values
NUM_DISTINCT_NIMBUS_VERSION;


  was:
The following summary items haven't been ported as nimbus metrics yet. 

//Declared in StormConf. I don't see the value in reporting so.
SUPERVISOR_TOTAL_RESOURCE,

//May be able to aggregate based on status;
TOPOLOGY_STATUS,
TOPOLOGY_SCHED_STATUS,

//May be aggregated: e.g., distinct values
NUM_DISTINCT_NIMBUS_VERSION;



> Migrate more items from ClusterSummary to metrics
> -
>
> Key: STORM-3191
> URL: https://issues.apache.org/jira/browse/STORM-3191
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Zhengdai Hu
>Priority: Minor
>
> The following summary items haven't been ported as nimbus metrics yet.
> //Declared in StormConf. I don't see the value in reporting so.
> SUPERVISOR_TOTAL_RESOURCE,
> //May be able to aggregate based on status;
> TOPOLOGY_STATUS,
> TOPOLOGY_SCHED_STATUS,
> //May be aggregated: e.g., distinct values
> NUM_DISTINCT_NIMBUS_VERSION;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (STORM-3193) Improve LogviewerLogSearchHandler

2018-08-10 Thread Zhengdai Hu (JIRA)

Zhengdai Hu created STORM-3193:
--

 Summary: Improve LogviewerLogSearchHandler
 Key: STORM-3193
 URL: https://issues.apache.org/jira/browse/STORM-3193
 Project: Apache Storm
  Issue Type: Improvement
Reporter: Zhengdai Hu


One thing that is worthy of noticing: Storm UI currently interweaves different 
search API regarding searching functionalities and it's kind of confusing.

Specifically:

For the search button at homepage, it uses a single deep search API to search 
all ports (server side process), both archived and non-archived.
For non-archived search at a specific topology page, it invokes search API on 
each port inside a loop (client side process).
For archived search at a specific topology page, it invokes deep search API 
(search-archived=on) on each port inside a loop (client side process)
As a result, metrics for these APIs may not accurately reflect how many 
searches are invoked from client's perspective.

Additionally, `findNMatches` can be simplified, along with STORM-3189



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

96 matches

Mail list logo