[jira] [Commented] (HELIX-785) Report helix latency instead of user latency during top state handoff

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673659#comment-16673659
 ] 

ASF GitHub Bot commented on HELIX-785:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/292


> Report helix latency instead of user latency during top state handoff
> -
>
> Key: HELIX-785
> URL: https://issues.apache.org/jira/browse/HELIX-785
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Currently we are reporting top state handoff user latency, but we should 
> report Helix latency instead. user should have their way of monitoring their 
> own state transitions.
> AC:
> 1. Implement reporting Helix latency for top state handoff and test it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-785) Report helix latency instead of user latency during top state handoff

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673635#comment-16673635
 ] 

ASF GitHub Bot commented on HELIX-785:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/292

[HELIX-785] Record helix latency instead of user latency in top state 
handoff metrics

- top state handoff reports helix latency instead of user latency
- modified test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/top-state-handoff-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/292.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #292


commit 37a58cfff91fb5f6608a4a06d1922bb5a5eb9ca1
Author: Harry Zhang 
Date:   2018-11-02T18:30:15Z

[HELIX-785] Record helix latency instead of user latency in top state 
handoff metrics




> Report helix latency instead of user latency during top state handoff
> -
>
> Key: HELIX-785
> URL: https://issues.apache.org/jira/browse/HELIX-785
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Currently we are reporting top state handoff user latency, but we should 
> report Helix latency instead. user should have their way of monitoring their 
> own state transitions.
> AC:
> 1. Implement reporting Helix latency for top state handoff and test it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672463#comment-16672463
 ] 

ASF GitHub Bot commented on HELIX-780:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/289


> Support get/add rest api for workflow/job/task user content
> ---
>
> Key: HELIX-780
> URL: https://issues.apache.org/jira/browse/HELIX-780
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support get/add rest api for workflow/job/task user content
> AC:
>  * finish implementation
>  * test code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672412#comment-16672412
 ] 

ASF GitHub Bot commented on HELIX-780:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/289

[HELIX-780] add task user content related api and added more tests

- added get/add task user content rest api
- consolidated rest api behavior: when getting/adding user content, if 
job/workflow does not exist, throw 404
- added more test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/tf-rest-api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #289


commit 18aa67b6d5c703e5b938b2f915f52a6ca856e889
Author: Harry Zhang 
Date:   2018-10-09T21:31:00Z

[HELIX-780] add task user content related api and added more tests




> Support get/add rest api for workflow/job/task user content
> ---
>
> Key: HELIX-780
> URL: https://issues.apache.org/jira/browse/HELIX-780
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support get/add rest api for workflow/job/task user content
> AC:
>  * finish implementation
>  * test code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672227#comment-16672227
 ] 

ASF GitHub Bot commented on HELIX-780:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/287


> Support get/add rest api for workflow/job/task user content
> ---
>
> Key: HELIX-780
> URL: https://issues.apache.org/jira/browse/HELIX-780
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support get/add rest api for workflow/job/task user content
> AC:
>  * finish implementation
>  * test code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672053#comment-16672053
 ] 

ASF GitHub Bot commented on HELIX-780:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/287

[HELIX-780] add get/add job user content rest api

added apis and tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/tf-rest-api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #287


commit a09a18ac55464c3e399800b4474ccb6e64d168ec
Author: Harry Zhang 
Date:   2018-10-08T22:36:53Z

[HELIX-780] add get/add job user content rest api




> Support get/add rest api for workflow/job/task user content
> ---
>
> Key: HELIX-780
> URL: https://issues.apache.org/jira/browse/HELIX-780
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support get/add rest api for workflow/job/task user content
> AC:
>  * finish implementation
>  * test code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672032#comment-16672032
 ] 

ASF GitHub Bot commented on HELIX-780:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/286


> Support get/add rest api for workflow/job/task user content
> ---
>
> Key: HELIX-780
> URL: https://issues.apache.org/jira/browse/HELIX-780
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support get/add rest api for workflow/job/task user content
> AC:
>  * finish implementation
>  * test code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-779) Maintenance rebalancer should not clear preference list in ideal state

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671989#comment-16671989
 ] 

ASF GitHub Bot commented on HELIX-779:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/285


> Maintenance rebalancer should not clear preference list in ideal state
> --
>
> Key: HELIX-779
> URL: https://issues.apache.org/jira/browse/HELIX-779
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Setting list fields to empty map will prevent newly added and initially 
> rebalanced resources during maintenance mode from getting re-balanced after 
> cluster exists maintenance mode.
> The right thing to do is to clear every preference list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-779) Maintenance rebalancer should not clear preference list in ideal state

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671943#comment-16671943
 ] 

ASF GitHub Bot commented on HELIX-779:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/285

[HELIX-779] do not clean list field in maintenance rebalancer for new 
resources

Setting list fields to empty map will prevent newly added and initially 
rebalanced resources during maintenance mode from getting re-balanced after 
cluster exists maintenance mode.
The right thing to do is to clear every preference list.


Also added test case to verify

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/maintenance-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #285


commit bfaa8399529b6e63b307c1fbe60903c3ca08fbb1
Author: Harry Zhang 
Date:   2018-10-04T22:50:16Z

[HELIX-779] do not clean list field in maintenance rebalancer for new 
resources




> Maintenance rebalancer should not clear preference list in ideal state
> --
>
> Key: HELIX-779
> URL: https://issues.apache.org/jira/browse/HELIX-779
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Setting list fields to empty map will prevent newly added and initially 
> rebalanced resources during maintenance mode from getting re-balanced after 
> cluster exists maintenance mode.
> The right thing to do is to clear every preference list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670734#comment-16670734
 ] 

ASF GitHub Bot commented on HELIX-775:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/283


> Task driver should support add/get task framework user content
> --
>
> Key: HELIX-775
> URL: https://issues.apache.org/jira/browse/HELIX-775
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Task driver should support add/get task framework user content at 
> workflow/job/task levels
>  
> AC:
>  * finish implementation
>  * add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670730#comment-16670730
 ] 

ASF GitHub Bot commented on HELIX-775:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/283

[HELIX-775] consolidate user content related apis for task driver

HELIX-1315: consolidate user content related apis for task driver


To consolidate task driver user content related apis, and corresponding 
rest apis, I'm deprecating the general getUserContent() api, but instead, we 
now have the following apis for get / add / update user content.

```java
public void addOrUpdateWorkflowUserContentMap(String workflowName,
  final Map contentToAddOrUpdate);

public void addOrUpdateJobUserContentMap(String workflowName, String 
jobName,
  final Map contentToAddOrUpdate);

public void addOrUpdateTaskUserContentMap(String workflowName, String 
jobName,
  String taskPartitionId, final Map 
contentToAddOrUpdate);


public Map getWorkflowUserContentMap(String workflowName);


public Map getJobUserContentMap(String workflowName, String 
jobName);

public Map getTaskUserContentMap(String workflowName, 
String jobName,
  String taskPartitionId);
```

delete user content api tbd but can use the same convension

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/283.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #283


commit b235c4ee5a82c5970d29e839317ea242813a58bc
Author: Harry Zhang 
Date:   2018-10-04T18:25:08Z

[HELIX-775] consolidate user content related apis for task driver




> Task driver should support add/get task framework user content
> --
>
> Key: HELIX-775
> URL: https://issues.apache.org/jira/browse/HELIX-775
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Task driver should support add/get task framework user content at 
> workflow/job/task levels
>  
> AC:
>  * finish implementation
>  * add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670714#comment-16670714
 ] 

ASF GitHub Bot commented on HELIX-775:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/282


> Task driver should support add/get task framework user content
> --
>
> Key: HELIX-775
> URL: https://issues.apache.org/jira/browse/HELIX-775
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Task driver should support add/get task framework user content at 
> workflow/job/task levels
>  
> AC:
>  * finish implementation
>  * add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670711#comment-16670711
 ] 

ASF GitHub Bot commented on HELIX-775:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/282

[HELIX-775] add task driver support for helix rest to add/get task fr…

…amework user content


consolidate user content related apis for task driver


To consolidate task driver user content related apis, and corresponding 
rest apis, I'm deprecating the general getUserContent() api, but instead, we 
now have the following apis for get / add / update user content.

```java
public void addOrUpdateWorkflowUserContentMap(String workflowName,
  final Map contentToAddOrUpdate);

public void addOrUpdateJobUserContentMap(String workflowName, String 
jobName,
  final Map contentToAddOrUpdate);

public void addOrUpdateTaskUserContentMap(String workflowName, String 
jobName,
  String taskPartitionId, final Map 
contentToAddOrUpdate);


public Map getWorkflowUserContentMap(String workflowName);


public Map getJobUserContentMap(String workflowName, String 
jobName);

public Map getTaskUserContentMap(String workflowName, 
String jobName,
  String taskPartitionId);
```

API for deleting user content is TBD but can use the same convension

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/282.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #282


commit 7ec5313bccb679014d6a0605ee5d7184063e555e
Author: Harry Zhang 
Date:   2018-10-31T20:55:44Z

[HELIX-775] add task driver support for helix rest to add/get task 
framework user content




> Task driver should support add/get task framework user content
> --
>
> Key: HELIX-775
> URL: https://issues.apache.org/jira/browse/HELIX-775
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Task driver should support add/get task framework user content at 
> workflow/job/task levels
>  
> AC:
>  * finish implementation
>  * add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-773) Support getLastScheduledTaskTimestamp information in workflow rest api

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670702#comment-16670702
 ] 

ASF GitHub Bot commented on HELIX-773:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/281


> Support getLastScheduledTaskTimestamp information in workflow rest api
> --
>
> Key: HELIX-773
> URL: https://issues.apache.org/jira/browse/HELIX-773
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Support getLastScheduledTaskTimestamp information in workflow rest api



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-772) Support TaskDriver.addUserContent() api

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670698#comment-16670698
 ] 

ASF GitHub Bot commented on HELIX-772:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/280


> Support TaskDriver.addUserContent() api
> ---
>
> Key: HELIX-772
> URL: https://issues.apache.org/jira/browse/HELIX-772
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support add user content in task driver
>  
> AC:
>  * implement APi
>  * add test
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-773) Support getLastScheduledTaskTimestamp information in workflow rest api

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669438#comment-16669438
 ] 

ASF GitHub Bot commented on HELIX-773:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/281

[HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest 
API

- Added TaskExecutionInfo object to wrap task execution information
- added TaskExecutionInfo to last scheduled task in workflow property in 
workflow rest API
- Modified related tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/workflow-rest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #281


commit 917f6b7ee1b2b44b10eea7e5de7f07aa7f184618
Author: Harry Zhang 
Date:   2018-10-30T23:43:25Z

[HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest 
api




> Support getLastScheduledTaskTimestamp information in workflow rest api
> --
>
> Key: HELIX-773
> URL: https://issues.apache.org/jira/browse/HELIX-773
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Support getLastScheduledTaskTimestamp information in workflow rest api



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-772) Support TaskDriver.addUserContent() api

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669436#comment-16669436
 ] 

ASF GitHub Bot commented on HELIX-772:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/280

[HELIX-772] add TaskDriver.addUserContent() api and related tests


Implemented TaskDriver.addUserContent()
Added test (TestGetSetUserContentStore) for testing all getter/setter for 
user content
Modified unstable TestIndependentTaskRebalancer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/add-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #280


commit df24f5975bd517626490f14e6e038f8370ddd815
Author: Harry Zhang 
Date:   2018-10-30T23:25:12Z

[HELIX-772] add TaskDriver.addUserContent() api and related tests




> Support TaskDriver.addUserContent() api
> ---
>
> Key: HELIX-772
> URL: https://issues.apache.org/jira/browse/HELIX-772
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Need to support add user content in task driver
>  
> AC:
>  * implement APi
>  * add test
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-771) More detailed top state handoff metrics

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669418#comment-16669418
 ] 

ASF GitHub Bot commented on HELIX-771:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/278


> More detailed top state handoff metrics
> ---
>
> Key: HELIX-771
> URL: https://issues.apache.org/jira/browse/HELIX-771
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> To define top state handoff SLA, we need some more detailed data:
>  * graceful top state handoff (i.e. disable instance / resource / etc, both 
> Helix and e2e latency)
>  * abrupt top state handoff (i.e. node crash)
> AC:
>  - prepare metrics, test, code complete



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-771) More detailed top state handoff metrics

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669413#comment-16669413
 ] 

ASF GitHub Bot commented on HELIX-771:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/278

[HELIX-771] More detailed top state handoff metrics


Added more details about top state handoff to distinguish helix latency and 
user latency


We define there are 2 types of handoff
- Graceful handoff (controlled top state handoff, i.e. disable instance, 
load balance, etc)
- Non-Graceful (uncontroller top state handoff, i.e. node crash, etc)


For graceful handoff, we record total handoff latency and user latency
For non-graceful handoff, we record total handoff only


Moved top state handoff metrics to an independent stage to make logics 
cleaner.\
Refactored TestTopStateHandoffmetrics to make it cleaner and more json more 
natively

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/topstate-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #278


commit 7e49f995e29ea200fcc42ce6af148ed521979f5c
Author: Harry Zhang 
Date:   2018-10-30T22:55:20Z

[HELIX-771] More detailed top state handoff metrics




> More detailed top state handoff metrics
> ---
>
> Key: HELIX-771
> URL: https://issues.apache.org/jira/browse/HELIX-771
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> To define top state handoff SLA, we need some more detailed data:
>  * graceful top state handoff (i.e. disable instance / resource / etc, both 
> Helix and e2e latency)
>  * abrupt top state handoff (i.e. node crash)
> AC:
>  - prepare metrics, test, code complete



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-770) HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667948#comment-16667948
 ] 

ASF GitHub Bot commented on HELIX-770:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/277


> HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage
> --
>
> Key: HELIX-770
> URL: https://issues.apache.org/jira/browse/HELIX-770
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, 
> statePriorityMap was throwing a NPE because the partition contained a replica 
> in ERROR state, and the map did not have an entry for it. To amend the issue, 
> Venice added the ERROR state in the state model with a priority, and Helix 
> added checks to prevent NPEs. Changelist: 1. Add containsKey checks in 
> isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log 
> all partitions with ERROR state replicas 3. Add HelixDefinedStates in 
> statePriorityList if not already added



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-770) HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667902#comment-16667902
 ] 

ASF GitHub Bot commented on HELIX-770:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/277

[HELIX-770] HELIX: Fix a possible NPE in loadBalance in IntermediateS…

…tateCalcStage

In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, 
statePriorityMap was throwing a NPE because the partition contained a replica 
in ERROR state, and the map did not have an entry for it. To amend the issue, 
Venice added the ERROR state in the state model with a priority, and Helix 
added checks to prevent NPEs.
Changelist:
1. Add containsKey checks in isLoadBalanceDownwardForAllReplicas()
2. Make the Controller correctly log all partitions with ERROR state 
replicas
3. Add HelixDefinedStates in statePriorityList if not already added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #277


commit 7bc70e24abd89611580098670ed02b2736ccfac0
Author: Hunter Lee 
Date:   2018-10-29T23:50:41Z

[HELIX-770] HELIX: Fix a possible NPE in loadBalance in 
IntermediateStateCalcStage

In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, 
statePriorityMap was throwing a NPE because the partition contained a replica 
in ERROR state, and the map did not have an entry for it. To amend the issue, 
Venice added the ERROR state in the state model with a priority, and Helix 
added checks to prevent NPEs.
Changelist:
1. Add containsKey checks in isLoadBalanceDownwardForAllReplicas()
2. Make the Controller correctly log all partitions with ERROR state 
replicas
3. Add HelixDefinedStates in statePriorityList if not already added




> HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage
> --
>
> Key: HELIX-770
> URL: https://issues.apache.org/jira/browse/HELIX-770
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, 
> statePriorityMap was throwing a NPE because the partition contained a replica 
> in ERROR state, and the map did not have an entry for it. To amend the issue, 
> Venice added the ERROR state in the state model with a priority, and Helix 
> added checks to prevent NPEs. Changelist: 1. Add containsKey checks in 
> isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log 
> all partitions with ERROR state replicas 3. Add HelixDefinedStates in 
> statePriorityList if not already added



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-756) TASK: Change LOG mode from info to debug

2018-10-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665820#comment-16665820
 ] 

ASF GitHub Bot commented on HELIX-756:
--

Github user narendly closed the pull request at:

https://github.com/apache/helix/pull/271


> TASK: Change LOG mode from info to debug
> 
>
> Key: HELIX-756
> URL: https://issues.apache.org/jira/browse/HELIX-756
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> In production, it was observed that some users were running thousands of 
> tasks, and since AssignableInstance leaves a line of log for each task 
> assigned or released, the amount of log that was being generated was too 
> much, and it was too verbose.
> Changelist:
> 1. Change the logging mode from info to debug in AssignableInstance and 
> AssignableInstanceManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-753) Record top state handoff finished in single cluster data cache refresh

2018-10-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664398#comment-16664398
 ] 

ASF GitHub Bot commented on HELIX-753:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/270


> Record top state handoff finished in single cluster data cache refresh
> --
>
> Key: HELIX-753
> URL: https://issues.apache.org/jira/browse/HELIX-753
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Currently we are calculating top state handoff duration by doing the 
> following:
>  - record missing top state when we see a top state missing
>  - record top state come back when we see it come back
>  - report top state handoff duration
> This is perfectly fine for non-P2P state transitions as the entire top state 
> handoff process will always finish for >= 2 pipeline runs. However, for P2P 
> enabled clusters, top state handoff are quick, and if it is quicker than 
> cluster data refresh stage latency, we will lose a lot of short top state 
> handoffs, which make the number miserable on ingraph.
> We need to revise top state handoff metrics implementation so we don't lose 
> data point statistically (i.e. we are losing all short handoffs now).
> AC:
>  - revise impl so we catch those short top state hand-offs
>  - write new tests to catch the fix if needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-756) TASK: Change LOG mode from info to debug

2018-09-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627683#comment-16627683
 ] 

ASF GitHub Bot commented on HELIX-756:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/271

[HELIX-756] TASK: Change LOG mode from info to debug



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/271.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #271


commit 5140db0c50439c115d0c7d2637f7ad723f6f147a
Author: Hunter Lee 
Date:   2018-09-25T17:19:31Z

[HELIX-754] TASK: Fix LiveInstanceCurrentState change flag

Previously, existsLiveInstanceOrCurrentStateChange was getting reset in 
ClusterDataCache when its getter was called. This was problematic because if 
there were multiple jobs or multiple workflows, whoever calls this getter would 
get the correct flag value, and the ensuing callers would get a false because 
the flag would have been reset. This RB fixes that bug by reseting the flat 
right in the beginning of refresh() call in ClusterDataCache, which allows all 
callers during that pipeline would get the same, correct value.
Changelist:
1. Change the getter so that it does not reset the flag; instead, reset the 
flag in the beginning of refresh()

commit e9f6c98dc58cb7bb842ff1f41063174003277823
Author: Hunter Lee 
Date:   2018-09-25T17:22:36Z

[HELIX-755] TASK: Build quota profile from scratch every rebalance

It has been reported that instances have a full quota despite no tasks 
existing in their CURRENTSTATES. The cause of this is not clear, so making 
ClusterDataCache trigger a refresh of all AssignableInstances will ensure that 
there aren't situations where it looks like there has been a thread leak. 
Optimizations will be implemented if necessary. Changelist: 1. Make 
AssignableInstanceManager build all AssignableInstances from scratch every 
rebalance

commit 4aaa00727fe34b9cdfde2978eeb8a892dcf29add
Author: Hunter Lee 
Date:   2018-09-25T17:25:39Z

[HELIX-756] TASK: Change LOG mode from info to debug

In production, it was observed that some users were running thousands of 
tasks, and since AssignableInstance leaves a line of log for each task assigned 
or released, the amount of log that was being generated was too much, and it 
was too verbose.
Changelist:
1. Change the logging mode from info to debug in AssignableInstance and 
AssignableInstanceManager




> TASK: Change LOG mode from info to debug
> 
>
> Key: HELIX-756
> URL: https://issues.apache.org/jira/browse/HELIX-756
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> In production, it was observed that some users were running thousands of 
> tasks, and since AssignableInstance leaves a line of log for each task 
> assigned or released, the amount of log that was being generated was too 
> much, and it was too verbose.
> Changelist:
> 1. Change the logging mode from info to debug in AssignableInstance and 
> AssignableInstanceManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-753) Record top state handoff finished in single cluster data cache refresh

2018-09-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624211#comment-16624211
 ] 

ASF GitHub Bot commented on HELIX-753:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/270

[HELIX-753] Record top state handoff finished in single cluster data cache 
refresh

This PR adds top state handoff reporting when a single pipeline refresh 
catches the entire handoff process, which we missed before. Here is the rough 
procedure:


- retrieve cached last top state instance for a partition
- retrieve current top state instance for a partition
- if there is no missing top state record of that partition, and top state 
instance changed, we record the number

Current top state end time is easy to find from current state in cluster 
data cache, for handoff start time, if we cannot find it, we use last pipeline 
run's end time for best guess. Detailed reason is explained in code comment.


Added test case to verify such top state handoff, and consolidated common 
part in TestTopStateHandoffMetrics for avoiding code replication

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/topstate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #270


commit d501e8fa30596d9cd98078f0d1ce7c1ecf20c595
Author: Harry Zhang 
Date:   2018-09-21T21:32:15Z

[HELIX-753] Record top state handoff finished in single cluster data cache 
refresh




> Record top state handoff finished in single cluster data cache refresh
> --
>
> Key: HELIX-753
> URL: https://issues.apache.org/jira/browse/HELIX-753
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Harry Zhang
>Assignee: Harry Zhang
>Priority: Major
>
> Currently we are calculating top state handoff duration by doing the 
> following:
>  - record missing top state when we see a top state missing
>  - record top state come back when we see it come back
>  - report top state handoff duration
> This is perfectly fine for non-P2P state transitions as the entire top state 
> handoff process will always finish for >= 2 pipeline runs. However, for P2P 
> enabled clusters, top state handoff are quick, and if it is quicker than 
> cluster data refresh stage latency, we will lose a lot of short top state 
> handoffs, which make the number miserable on ingraph.
> We need to revise top state handoff metrics implementation so we don't lose 
> data point statistically (i.e. we are losing all short handoffs now).
> AC:
>  - revise impl so we catch those short top state hand-offs
>  - write new tests to catch the fix if needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-751) TASK: Fix AssignableInstanceComparator so that it sorts unsupported quota types

2018-09-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624157#comment-16624157
 ] 

ASF GitHub Bot commented on HELIX-751:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/268


> TASK: Fix AssignableInstanceComparator so that it sorts unsupported quota 
> types
> ---
>
> Key: HELIX-751
> URL: https://issues.apache.org/jira/browse/HELIX-751
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> Currently, if the quota type does not exist, it will not sort 
> AssignableInstances based on availability. This does not cause immediate 
> problems, but it would be nice to have them sorted because we now allow 
> unsupported quota types run as DEFAULT type.
> Changelist:
> 1. Comparator sorts AssignableInstances in a PriorityQueue by DEFAULT type's 
> availability when the quota type given is unsupported



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-752) Add missing shutdown for RoutingTableProvider

2018-09-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620953#comment-16620953
 ] 

ASF GitHub Bot commented on HELIX-752:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/269

[HELIX-752] Add missing shutdown for RoutingTableProvider

Changelist:
1. Add a missing shutdown() call to avoid having a background thread keep 
printing out error messages

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix e8e5770

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #269


commit 355e4d4e1e678d13793f39d044a51e7be76abed8
Author: Hunter Lee 
Date:   2018-09-19T17:50:48Z

[HELIX-752] Add missing shutdown for RoutingTableProvider

Changelist:
1. Add a missing shutdown() call to avoid having a background thread keep 
printing out error messages




> Add missing shutdown for RoutingTableProvider
> -
>
> Key: HELIX-752
> URL: https://issues.apache.org/jira/browse/HELIX-752
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Assignee: Hunter L
>Priority: Major
>
> Changelist:
> 1. Add a missing shutdown() call to avoid having a background thread keep 
> printing out error messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-741) Revise unreliable behavior in swapInstance

2018-07-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553439#comment-16553439
 ] 

ASF GitHub Bot commented on HELIX-741:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/258


> Revise unreliable behavior in swapInstance
> --
>
> Key: HELIX-741
> URL: https://issues.apache.org/jira/browse/HELIX-741
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> swapInstance call did not work properly when we were trying to fix a 
> production issue.
>  
> The API was old and not actively maintained. It used deprecated underlaying 
> data accessor API and hit a problem of partial ZK read. Thus our CLI was 
> unable to update all IdealStates as expected.
> We have seen such problem before, especially when the cluster is bug and 
> there are a lot of data to read back.
> This ticket is created to refactor the implementation of swapInstance() to 
> make it more robust, and separate ticket will be created to revise those old 
> API calls that are not frequently used not actively maintained.
> –
> AC:
>  - make this api call reliable and idempotent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-741) Revise unreliable behavior in swapInstance

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547270#comment-16547270
 ] 

ASF GitHub Bot commented on HELIX-741:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/258

[HELIX-741] make swap instance more robust and idempotent

Made swap instance more robust:
1. List ideal state names and read ideal state individually to avoid 
partial read
2. remove redundant logics that test old instance status
3. make it idempotent
4. added test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-admin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #258


commit 24c52394dfff91c045367260c969f76560ebeb62
Author: Harry Zhang 
Date:   2018-07-18T01:21:48Z

[HELIX-741] make swap instance more robust and idempotent




> Revise unreliable behavior in swapInstance
> --
>
> Key: HELIX-741
> URL: https://issues.apache.org/jira/browse/HELIX-741
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> swapInstance call did not work properly when we were trying to fix a 
> production issue.
>  
> The API was old and not actively maintained. It used deprecated underlaying 
> data accessor API and hit a problem of partial ZK read. Thus our CLI was 
> unable to update all IdealStates as expected.
> We have seen such problem before, especially when the cluster is bug and 
> there are a lot of data to read back.
> This ticket is created to refactor the implementation of swapInstance() to 
> make it more robust, and separate ticket will be created to revise those old 
> API calls that are not frequently used not actively maintained.
> –
> AC:
>  - make this api call reliable and idempotent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-740) ZkHelixAdmin:NPE

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547268#comment-16547268
 ] 

ASF GitHub Bot commented on HELIX-740:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/257


> ZkHelixAdmin:NPE
> 
>
> Key: HELIX-740
> URL: https://issues.apache.org/jira/browse/HELIX-740
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> The NPE occurs in this line: 
> [https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L669]
> Basically, we ended up in a situation where we had an instance whose config 
> was deleted. The line above should handle this more gracefully;we need more 
> meaningful error information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-740) ZkHelixAdmin:NPE

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547184#comment-16547184
 ] 

ASF GitHub Bot commented on HELIX-740:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/257

[HELIX-740] check NPE in getInstancesInClusterWithTag and throw more 
meaningful exception

Added cluster config check in `getInstancesInClusterWithTag()` and throw 
IllegalStateException when instance config is missing

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-admin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #257


commit f4bb7d60782150c7d713c907211cc9d41f002c48
Author: Harry Zhang 
Date:   2018-07-17T22:50:02Z

[HELIX-740] check NPE in getInstancesInClusterWithTag and throw more 
meaningful exception




> ZkHelixAdmin:NPE
> 
>
> Key: HELIX-740
> URL: https://issues.apache.org/jira/browse/HELIX-740
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> The NPE occurs in this line: 
> [https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L669]
> Basically, we ended up in a situation where we had an instance whose config 
> was deleted. The line above should handle this more gracefully;we need more 
> meaningful error information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-732) [TASK] Expose UserContentStore in TaskDriver

2018-07-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545781#comment-16545781
 ] 

ASF GitHub Bot commented on HELIX-732:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/246


> [TASK] Expose UserContentStore in TaskDriver
> 
>
> Key: HELIX-732
> URL: https://issues.apache.org/jira/browse/HELIX-732
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> There was a user request for this feature. The intended use is to allow for 
> aggregation work reading from temporary data written by tasks, by allowing a 
> get() of UserContentStore at the TaskDriver level. UserContentStore is a 
> potentially useful feature that is currently under-utilized - this will 
> enable Gobblin and other users of Task Framework to better utilize 
> UserContentStore.
> Changelist:
> 1. Add getUserContentStore() in TaskDriver
> 2. Add TestUserContentStore, an integration test for this feature
> 3. Add descriptive JavaDoc warning the user that get() and put() methods for 
> UserContentStore is not thread-safe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-732) [TASK] Expose UserContentStore in TaskDriver

2018-07-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545641#comment-16545641
 ] 

ASF GitHub Bot commented on HELIX-732:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/246

[HELIX-732] Expose UserContentStore in TaskDriver

There was a user request for this feature. The intended use is to allow for 
aggregation work reading from temporary data written by tasks, by allowing a 
get() of UserContentStore at the TaskDriver level. UserContentStore is a 
potentially useful feature that is currently under-utilized - this will enable 
Gobblin and other users of Task Framework to better utilize UserContentStore.

Changelist:
1. Add getUserContentStore() in TaskDriver
2. Add TestUserContentStore, an integration test for this feature
3. Add descriptive JavaDoc warning the user that get() and put() methods 
for UserContentStore is not thread-safe

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246


commit cdf91ecdc63b80254b6775f7f12f98440081473a
Author: Hunter Lee 
Date:   2018-07-16T19:18:03Z

[HELIX-732] Expose UserContentStore in TaskDriver

There was a user request for this feature. The intended use is to allow for 
aggregation work reading from temporary data written by tasks, by allowing a 
get() of UserContentStore at the TaskDriver level. UserContentStore is a 
potentially useful feature that is currently under-utilized - this will enable 
Gobblin and other users of Task Framework to better utilize UserContentStore.

Changelist:
1. Add getUserContentStore() in TaskDriver
2. Add TestUserContentStore, an integration test for this feature
3. Add descriptive JavaDoc warning the user that get() and put() methods 
for UserContentStore is not thread-safe




> [TASK] Expose UserContentStore in TaskDriver
> 
>
> Key: HELIX-732
> URL: https://issues.apache.org/jira/browse/HELIX-732
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> There was a user request for this feature. The intended use is to allow for 
> aggregation work reading from temporary data written by tasks, by allowing a 
> get() of UserContentStore at the TaskDriver level. UserContentStore is a 
> potentially useful feature that is currently under-utilized - this will 
> enable Gobblin and other users of Task Framework to better utilize 
> UserContentStore.
> Changelist:
> 1. Add getUserContentStore() in TaskDriver
> 2. Add TestUserContentStore, an integration test for this feature
> 3. Add descriptive JavaDoc warning the user that get() and put() methods for 
> UserContentStore is not thread-safe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-719) [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539236#comment-16539236
 ] 

ASF GitHub Bot commented on HELIX-719:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/226


> [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle
> --
>
> Key: HELIX-719
> URL: https://issues.apache.org/jira/browse/HELIX-719
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> TestPartitionMovementThrottle was failing after the improvement was made in 
> IntermediateCalcStage so that downward load balance will take place while 
> recovery balance is happening. In the process of fixing the test, 1. It was 
> verified by hand that downward load balance is being correctly throttled as 
> defined by the user in StateTransitionThrottleConfig. 2. An appropriate 
> parameter adjustment was made to account for both recovery and load balance 
> happening in the same pipeline iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-726) Add new monitor metrics for state transitions.

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539070#comment-16539070
 ] 

ASF GitHub Bot commented on HELIX-726:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/233


> Add new monitor metrics for state transitions.
> --
>
> Key: HELIX-726
> URL: https://issues.apache.org/jira/browse/HELIX-726
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
>
> ClusterStatus: MissingMinActiveReplicaPartitionGauge
> ClusterStatus: TotalResourceGauge
> ClusterStatus/ResourceStatus: PendingStateTransitionsGauge
> ClusterStatus: StateTransitionsCounter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-726) Add new monitor metrics for state transitions.

2018-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539064#comment-16539064
 ] 

ASF GitHub Bot commented on HELIX-726:
--

GitHub user jiajunwang opened a pull request:

https://github.com/apache/helix/pull/233

[HELIX-726][HELIX-727] Helix monitor metrics improvement

Contains 2 changes that improve Helix monitoring.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiajunwang/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #233


commit 505524cf0484dfe83433862e8f59345899d2
Author: Jiajun Wang 
Date:   2018-05-26T06:30:11Z

Add new monitor metrics for state transitions.

ClusterStatus: MissingMinActiveReplicaPartitionGauge
ClusterStatus: TotalResourceGauge
ClusterStatus/ResourceStatus: PendingStateTransitionsGauge
ClusterStatus: StateTransitionsCounter

commit 93eea714d2a9607c1808a342957e59698a96f543
Author: Jiajun Wang 
Date:   2018-05-30T00:09:28Z

Fix resource monitor race condition.

The async monitor processing may cause resource mbean deleting failure. 
This will leave unnecessary mbean in the mbean server.




> Add new monitor metrics for state transitions.
> --
>
> Key: HELIX-726
> URL: https://issues.apache.org/jira/browse/HELIX-726
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
>
> ClusterStatus: MissingMinActiveReplicaPartitionGauge
> ClusterStatus: TotalResourceGauge
> ClusterStatus/ResourceStatus: PendingStateTransitionsGauge
> ClusterStatus: StateTransitionsCounter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-719) [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537806#comment-16537806
 ] 

ASF GitHub Bot commented on HELIX-719:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/226

[HELIX-719] [HELIX] Verify downward load balance and fix TestPartitio…

…nMovementThrottle

TestPartitionMovementThrottle was failing after the improvement was made in 
IntermediateCalcStage so that downward load balance will take place while 
recovery balance is happening. In the process of fixing the test,
1. It was verified by hand that downward load balance is being correctly 
throttled as defined by the user in StateTransitionThrottleConfig.
2. An appropriate parameter adjustment was made to account for both 
recovery and load balance happening in the same pipeline iteration.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix d

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/226.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #226


commit 13c39dac7aca7920f75d62a6c19f03c319d8e6cf
Author: Hunter Lee 
Date:   2018-07-09T23:56:08Z

[HELIX-719] [HELIX] Verify downward load balance and fix 
TestPartitionMovementThrottle

TestPartitionMovementThrottle was failing after the improvement was made in 
IntermediateCalcStage so that downward load balance will take place while 
recovery balance is happening. In the process of fixing the test,
1. It was verified by hand that downward load balance is being correctly 
throttled as defined by the user in StateTransitionThrottleConfig.
2. An appropriate parameter adjustment was made to account for both 
recovery and load balance happening in the same pipeline iteration.




> [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle
> --
>
> Key: HELIX-719
> URL: https://issues.apache.org/jira/browse/HELIX-719
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> TestPartitionMovementThrottle was failing after the improvement was made in 
> IntermediateCalcStage so that downward load balance will take place while 
> recovery balance is happening. In the process of fixing the test, 1. It was 
> verified by hand that downward load balance is being correctly throttled as 
> defined by the user in StateTransitionThrottleConfig. 2. An appropriate 
> parameter adjustment was made to account for both recovery and load balance 
> happening in the same pipeline iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537787#comment-16537787
 ] 

ASF GitHub Bot commented on HELIX-718:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/224


> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537784#comment-16537784
 ] 

ASF GitHub Bot commented on HELIX-718:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/223


> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537761#comment-16537761
 ] 

ASF GitHub Bot commented on HELIX-718:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/224

[HELIX-718] implement ThreadCountBasedTaskAssigner

In this RB, I implemented a thread count based task assigner that is 
optimized for short-term use cases. It assumes:
- All tasks to assign have same quota type
- All tasks to assign requires only 1 thread


The algorithms did best effort that tasks with same type / same job are 
spread out: i.e.
- if there are 3 nodes, each has 10 threads for each quota type A, B, and C
- node1 is empty, node2 and node3 each has 5 typeB tasks and 5 typeC tasks 
running
=> when 3 typeA tasks are to be assigned, it will assign 1 typeA task to 
each node rather than squeeze all 3 typeA tasks to node1.



Added tests for the assigner. Below is the profiling results, each result 
takes average of 100 trails:


Assign 50K tasks onto 1K nodes:

testing batch size: 1
Average time: 118ms
testing batch size: 5000
Average time: 114ms
testing batch size: 2000
Average time: 117ms
testing batch size: 1000
Average time: 119ms
testing batch size: 500
Average time: 123ms
testing batch size: 100
Average time: 182ms



Assign 10K tasks onto 1K nodes:

testing batch size: 1
Average time: 25ms
testing batch size: 5000
Average time: 21ms
testing batch size: 2000
Average time: 22ms
testing batch size: 1000
Average time: 25ms
testing batch size: 500
Average time: 22ms
testing batch size: 100
Average time: 34ms

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/simple-assigner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #224


commit 6cb574d5aea6ca9cb9e6b5184bc80cb5e05d53b8
Author: Harry Zhang 
Date:   2018-07-09T23:04:19Z

[HELIX-718] implement ThreadCountBasedTaskAssigner




> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537755#comment-16537755
 ] 

ASF GitHub Bot commented on HELIX-718:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/223

[HELIX-718] provide a method in AssignableInstance to set current assignment

This is required when an assignable instance is initialized, it needs to 
recover its current states

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/assignable-instance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/223.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #223


commit e44b29e03ef4c807e940cde717ed2f6fff58a273
Author: Harry Zhang 
Date:   2018-07-09T22:59:27Z

[HELIX-718] provide a method in AssignableInstance to set current 
assignments




> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537747#comment-16537747
 ] 

ASF GitHub Bot commented on HELIX-718:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/222


> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537739#comment-16537739
 ] 

ASF GitHub Bot commented on HELIX-718:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/222

[HELIX-718] implement AssignableInstance

Implement AssignableInstance and related tests as a part of task assigner

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/assignable-instance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #222


commit 2049f93abe8e56a754e4880a9157959ef24cd89e
Author: Harry Zhang 
Date:   2018-07-09T22:49:33Z

[HELIX-718] implement AssignableInstance




> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537706#comment-16537706
 ] 

ASF GitHub Bot commented on HELIX-718:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/220


> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-712) Backward compatibility of the rebalance algorithm

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537697#comment-16537697
 ] 

ASF GitHub Bot commented on HELIX-712:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/212


> Backward compatibility of the rebalance algorithm
> -
>
> Key: HELIX-712
> URL: https://issues.apache.org/jira/browse/HELIX-712
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
>
> For keeping CRUSHed stable, we need to split the logic changes made for 
> constraint based rebalance strategy. Otherwise, some improvement will change 
> the original assignment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-717) Add api for get / set quota type, ratio and participant capacity

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537690#comment-16537690
 ] 

ASF GitHub Bot commented on HELIX-717:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/219


> Add api for get / set quota type, ratio and participant capacity
> 
>
> Key: HELIX-717
> URL: https://issues.apache.org/jira/browse/HELIX-717
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> This is needed for supporting quota based task assignment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-718) Implement TaskAssignment logics

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537622#comment-16537622
 ] 

ASF GitHub Bot commented on HELIX-718:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/220

[HELIX-718] implement TaskAssignResult

Implement TaskAssignResult as a part of task assigner

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-assign-result

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #220


commit 701947d5a033792f21dd2796a29577702782fd26
Author: Harry Zhang 
Date:   2018-07-09T21:22:20Z

[HELIX-718] implement TaskAssignResult




> Implement TaskAssignment logics
> ---
>
> Key: HELIX-718
> URL: https://issues.apache.org/jira/browse/HELIX-718
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Implement assignment logics:
> TaskAssigner, TaskAssignResult, AssignableInstance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-717) Add api for get / set quota type, ratio and participant capacity

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537601#comment-16537601
 ] 

ASF GitHub Bot commented on HELIX-717:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/219

[HELIX-717] Add api for get / set quota type, ratio and participant capacity

Add api for get / set quota type, ratio and participant capacity

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-quota

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #219


commit 9ff603e9c39b53d5035cccb31fcf6edf82d97f18
Author: Harry Zhang 
Date:   2018-07-09T21:07:56Z

[HELIX-717] Add api for get / set quota type, ratio and participant capacity




> Add api for get / set quota type, ratio and participant capacity
> 
>
> Key: HELIX-717
> URL: https://issues.apache.org/jira/browse/HELIX-717
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> This is needed for supporting quota based task assignment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-713) Remove unused imports in TaskAssignmentCalculator

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537558#comment-16537558
 ] 

ASF GitHub Bot commented on HELIX-713:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/213


> Remove unused imports in TaskAssignmentCalculator
> -
>
> Key: HELIX-713
> URL: https://issues.apache.org/jira/browse/HELIX-713
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> Remove unused imports in TaskAssignmentCalculator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-709) Prepare controller stages for async execution

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537546#comment-16537546
 ] 

ASF GitHub Bot commented on HELIX-709:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/214


> Prepare controller stages for async execution
> -
>
> Key: HELIX-709
> URL: https://issues.apache.org/jira/browse/HELIX-709
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> There are a couple of stages in helix controller that can be executed 
> asynchronously, but each execution should be done in order. Currently for 
> helix controller, we have a thread pool for un-ordered execution, but we also 
> need one for ordered execution.
> In this ticket should do the following:
> 1. Create a pool of configurable workers using DedupEventProcessor
> 2. Create AbstractAsyncBaseStage for those stages that can be executed 
> asynchronously to share common code
> AC:
> Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, 
> pass all tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-709) Prepare controller stages for async execution

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537428#comment-16537428
 ] 

ASF GitHub Bot commented on HELIX-709:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/214

[HELIX-709] Move external view calculation to async stage and re-organize 
pipeline

- Separated controller pipeline to execute external view compute async and 
as early as possible
- renamed AbstractAsyncBaseStage
- fixed NPE in callback handler
- all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/async-ev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #214


commit 542fbc840a167986a40bd57f3c5660d294acb63c
Author: Harry Zhang 
Date:   2018-07-09T19:16:56Z

[HELIX-709] Move external view calculation to async stage and re-organize 
pipeline




> Prepare controller stages for async execution
> -
>
> Key: HELIX-709
> URL: https://issues.apache.org/jira/browse/HELIX-709
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> There are a couple of stages in helix controller that can be executed 
> asynchronously, but each execution should be done in order. Currently for 
> helix controller, we have a thread pool for un-ordered execution, but we also 
> need one for ordered execution.
> In this ticket should do the following:
> 1. Create a pool of configurable workers using DedupEventProcessor
> 2. Create AbstractAsyncBaseStage for those stages that can be executed 
> asynchronously to share common code
> AC:
> Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, 
> pass all tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-713) Remove unused imports in TaskAssignmentCalculator

2018-07-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537425#comment-16537425
 ] 

ASF GitHub Bot commented on HELIX-713:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/213

 [HELIX-713] Remove unused imports in TaskAssignmentCalculator


[HELIX-713] Remove unused imports in TaskAssignmentCalculator

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix 1301279

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #213


commit 7ab70a3d57454a8836cd13d8bb172e2b460474f6
Author: Hunter Lee 
Date:   2018-07-09T19:09:30Z

[HELIX-713] Remove unused imports in TaskAssignmentCalculator




> Remove unused imports in TaskAssignmentCalculator
> -
>
> Key: HELIX-713
> URL: https://issues.apache.org/jira/browse/HELIX-713
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> Remove unused imports in TaskAssignmentCalculator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-709) Prepare controller stages for async execution

2018-06-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526969#comment-16526969
 ] 

ASF GitHub Bot commented on HELIX-709:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/208


> Prepare controller stages for async execution
> -
>
> Key: HELIX-709
> URL: https://issues.apache.org/jira/browse/HELIX-709
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> There are a couple of stages in helix controller that can be executed 
> asynchronously, but each execution should be done in order. Currently for 
> helix controller, we have a thread pool for un-ordered execution, but we also 
> need one for ordered execution.
> In this ticket should do the following:
> 1. Create a pool of configurable workers using DedupEventProcessor
> 2. Create AbstractAsyncBaseStage for those stages that can be executed 
> asynchronously to share common code
> AC:
> Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, 
> pass all tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-710) Create abstract state model for distributed leader standby helix service

2018-06-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526856#comment-16526856
 ] 

ASF GitHub Bot commented on HELIX-710:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/209


> Create abstract state model for distributed leader standby helix service
> 
>
> Key: HELIX-710
> URL: https://issues.apache.org/jira/browse/HELIX-710
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> In order to implement state model def for other helix services, I'd prefer to 
> abstract an interface that helix service would use, to avoid duplicated code.
> AC:
>  - implement AbstractHelixLeaderStandbyStateModel and implement cluster 
> controller state model with it. The abstract model can also be used by other 
> helix services



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-710) Create abstract state model for distributed leader standby helix service

2018-06-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526817#comment-16526817
 ] 

ASF GitHub Bot commented on HELIX-710:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/209

[HELIX-710] Create abstract state model for distributed leader standby 
helix service

This RB abstracts a leader standby state model that helix services such as 
controller or other services  would commonly use. This reduces duplicated code 
and simplifies state model implementation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/abstract-ls-state-model

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #209


commit 4a99bc43c6f22e478a49fb7f2bbac42d608f17b5
Author: Harry Zhang 
Date:   2018-06-28T21:32:51Z

[HELIX-710] Create abstract state model for distributed leader standby 
helix service




> Create abstract state model for distributed leader standby helix service
> 
>
> Key: HELIX-710
> URL: https://issues.apache.org/jira/browse/HELIX-710
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> In order to implement state model def for other helix services, I'd prefer to 
> abstract an interface that helix service would use, to avoid duplicated code.
> AC:
>  - implement AbstractHelixLeaderStandbyStateModel and implement cluster 
> controller state model with it. The abstract model can also be used by other 
> helix services



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-709) Prepare controller stages for async execution

2018-06-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526801#comment-16526801
 ] 

ASF GitHub Bot commented on HELIX-709:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/208

[HELIX-709] Prepare controller stages for async execution

- Implemented AbstractAsyncBaseStage
- Refactored TEVCalcState and PersistAssignmentStage to use 
AbstractAsyncBaseStage

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/aabs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #208


commit 9080c64429d724aa959207411ca06d690f5ee840
Author: Harry Zhang 
Date:   2018-06-28T21:25:21Z

[HELIX-709] Prepare controller stages for async execution




> Prepare controller stages for async execution
> -
>
> Key: HELIX-709
> URL: https://issues.apache.org/jira/browse/HELIX-709
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> There are a couple of stages in helix controller that can be executed 
> asynchronously, but each execution should be done in order. Currently for 
> helix controller, we have a thread pool for un-ordered execution, but we also 
> need one for ordered execution.
> In this ticket should do the following:
> 1. Create a pool of configurable workers using DedupEventProcessor
> 2. Create AbstractAsyncBaseStage for those stages that can be executed 
> asynchronously to share common code
> AC:
> Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, 
> pass all tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-706) ExternalViewGeneration should be executed asynchronously in Helix controller

2018-06-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526787#comment-16526787
 ] 

ASF GitHub Bot commented on HELIX-706:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/206


> ExternalViewGeneration should be executed asynchronously in Helix controller
> 
>
> Key: HELIX-706
> URL: https://issues.apache.org/jira/browse/HELIX-706
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> EV generation should not block helix resource rebalance. According to our 
> profiling results, external view generation takes ~ 1/5 of the pipeline 
> latency.
> The goal is to generate external view asynchronously, and hopefully we can 
> have 20% improvement in rebalance pipeline



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-707) Fix topstate handoff metrics.

2018-06-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524390#comment-16524390
 ] 

ASF GitHub Bot commented on HELIX-707:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/207


> Fix topstate handoff metrics.
> -
>
> Key: HELIX-707
> URL: https://issues.apache.org/jira/browse/HELIX-707
> Project: Apache Helix
>  Issue Type: Bug
>Affects Versions: 0.8.x
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
>
> We've confirmed a bug in the logic that calculates topstate handoff duration.
> With this issue, if the previous master instance is offline, an older handoff 
> start time could be used to calculate the duration.
> This results in huge handoff duration in the Helix metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-707) Fix topstate handoff metrics.

2018-06-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524340#comment-16524340
 ] 

ASF GitHub Bot commented on HELIX-707:
--

GitHub user jiajunwang opened a pull request:

https://github.com/apache/helix/pull/207

[HELIX-707] Fix topstate handoff metrics.

We've confirmed a bug in the logic that calculates topstate handoff 
duration.
With this issue, if the previous master instance is offline, an older 
handoff start time could be used to calculate the duration.
This results in huge handoff duration in the Helix metrics.
This change will fix this bug. If the previous node that holds topstate 
replica goes to offline, the offline time will be used as the start time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiajunwang/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #207


commit 7753b602cee6d08a8326a68f899cb089378aae9f
Author: Jiajun Wang 
Date:   2018-04-27T17:56:43Z

Fix topstate handoff metrics.

We've confirmed a bug in the logic that calculates topstate handoff 
duration.
With this issue, if the previous master instance is offline, an older 
handoff start time could be used to calculate the duration.
This results in huge handoff duration in the Helix metrics.
This change will fix this bug. If the previous node that holds topstate 
replica goes to offline, the offline time will be used as the start time.

RB=1295351
G=helix-reviewers
A=lxia,hrzhang




> Fix topstate handoff metrics.
> -
>
> Key: HELIX-707
> URL: https://issues.apache.org/jira/browse/HELIX-707
> Project: Apache Helix
>  Issue Type: Bug
>Affects Versions: 0.8.x
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
>
> We've confirmed a bug in the logic that calculates topstate handoff duration.
> With this issue, if the previous master instance is offline, an older handoff 
> start time could be used to calculate the duration.
> This results in huge handoff duration in the Helix metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-706) ExternalViewGeneration should be executed asynchronously in Helix controller

2018-06-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524313#comment-16524313
 ] 

ASF GitHub Bot commented on HELIX-706:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/206

[HELIX-706] process tev and persist assignment asynchronously

Added async worker in generic helix controller to process persist 
assignment stage and tev generation state asynchronously

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/async-ev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #206


commit bb4ffd7e5663377427a5ad5988948659dd0db378
Author: Harry Zhang 
Date:   2018-06-26T23:05:50Z

[HELIX-706] process tev and persist assignment asynchronously




> ExternalViewGeneration should be executed asynchronously in Helix controller
> 
>
> Key: HELIX-706
> URL: https://issues.apache.org/jira/browse/HELIX-706
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> EV generation should not block helix resource rebalance. According to our 
> profiling results, external view generation takes ~ 1/5 of the pipeline 
> latency.
> The goal is to generate external view asynchronously, and hopefully we can 
> have 20% improvement in rebalance pipeline



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-705) Participant duplicated state transition handling rework

2018-06-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524288#comment-16524288
 ] 

ASF GitHub Bot commented on HELIX-705:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/204


> Participant duplicated state transition handling rework
> ---
>
> Key: HELIX-705
> URL: https://issues.apache.org/jira/browse/HELIX-705
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Helix should have some re-work on participant side message handling:
>  - Duplicated message in same batch: discard the later one
>  - Duplicated message in different batches, the later one should be discarded 
> if the first one is in progress
>  - During state transition, we should not rely on current state delta to get 
> partition's current state, but should lock on state model def (thread safety)
>  - Duplicated state transition (toState == currentState) should not result in 
> error, which is confusion, but should report success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-703) Change print statement to log statement

2018-06-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524264#comment-16524264
 ] 

ASF GitHub Bot commented on HELIX-703:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/203


> Change print statement to log statement
> ---
>
> Key: HELIX-703
> URL: https://issues.apache.org/jira/browse/HELIX-703
> Project: Apache Helix
>  Issue Type: Improvement
>  Components: helix-core
>Reporter: Hunter L
>Priority: Major
>
> Change print statement to log statement



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-705) Participant duplicated state transition handling rework

2018-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522917#comment-16522917
 ] 

ASF GitHub Bot commented on HELIX-705:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/204

[HELIX-705]: Participant duplicated state transition handling rework

Re-implemented helix task executor state transition message dedup logic, 
and added tests for verifying it:

- Duplicated message in same batch: discard the later one
- Duplicated message in different batches, the later one should be 
discarded if the first one is in progress
- During state transition, we should not rely on current state delta to get 
partition's current state, but should lock on state model def (thread safety)
- Duplicated state transition (toState == currentState) should not result 
in error, which is confusion, but should report success

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/participant-st-dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #204


commit 04f1ba9701ccfb4c55d44ab4bc159577c3afd68b
Author: Harry Zhang 
Date:   2018-06-25T22:55:14Z

[HELIX-705]: Participant duplicated state transition handling rework




> Participant duplicated state transition handling rework
> ---
>
> Key: HELIX-705
> URL: https://issues.apache.org/jira/browse/HELIX-705
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Helix should have some re-work on participant side message handling:
>  - Duplicated message in same batch: discard the later one
>  - Duplicated message in different batches, the later one should be discarded 
> if the first one is in progress
>  - During state transition, we should not rely on current state delta to get 
> partition's current state, but should lock on state model def (thread safety)
>  - Duplicated state transition (toState == currentState) should not result in 
> error, which is confusion, but should report success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-703) Change print statement to log statement

2018-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HELIX-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522846#comment-16522846
 ] 

ASF GitHub Bot commented on HELIX-703:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/203

[HELIX-703] Change print statement to log statement



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #203


commit 0d77cbafc6534dd7b0e9867b1dbf8a2266fd2281
Author: Hunter Lee 
Date:   2018-06-25T21:31:00Z

[HELIX-703] Change print statement to log statement




> Change print statement to log statement
> ---
>
> Key: HELIX-703
> URL: https://issues.apache.org/jira/browse/HELIX-703
> Project: Apache Helix
>  Issue Type: Improvement
>  Components: helix-core
>Reporter: Hunter L
>Priority: Major
>
> Change print statement to log statement



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-701) Potential ugly NPE

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471422#comment-16471422
 ] 

ASF GitHub Bot commented on HELIX-701:
--

Github user brettKK commented on the issue:

https://github.com/apache/helix/pull/200
  
@lujiefsi , https://issues.apache.org/jira/browse/HELIX-701 has been 
automatically associated with this PR.


> Potential ugly NPE
> --
>
> Key: HELIX-701
> URL: https://issues.apache.org/jira/browse/HELIX-701
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: brettkk
>Priority: Major
>
> We have developed a static analysis tool 
> [NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential 
> NPE. Our analysis shows that some callees may return null in corner case(e.g. 
> node crash , IOException), some of their callers have  _!=null_ check but 
> some do not have. In this issue we post a patch which can add  !=null  based 
> on existed !=null  check. For example:
> ZkGrep#parseZkSnapshot:
> {code:java}
>   return retFiles;
> } catch (Exception e) {
>   LOG.error("fail to parse zkSnapshot: " + lastZkSnapshot, e);
> }
> return null;{code}
> So parseZkSnapshot will return null while IOException happens. but its caller 
> ZkGrep#processCommandLineArgs have no null checker:
> {code:java}
> File[] lastZkSnapshot = parseZkSnapshot(zkDataDirs[1], byTime);
> // lastZkSnapshot[1] is the parsed last snapshot by byTime
> grepZkSnapshot(lastZkSnapshot[1], patterns);
> {code}
> We should terminate the process while  lastZkSnapshot == null 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-701) Potential ugly NPE

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470359#comment-16470359
 ] 

ASF GitHub Bot commented on HELIX-701:
--

Github user lujiefsi commented on the issue:

https://github.com/apache/helix/pull/200
  
we should combine https://issues.apache.org/jira/browse/HELIX-701 with this 
pull request,


> Potential ugly NPE
> --
>
> Key: HELIX-701
> URL: https://issues.apache.org/jira/browse/HELIX-701
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: brettkk
>Priority: Major
>
> We have developed a static analysis tool 
> [NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential 
> NPE. Our analysis shows that some callees may return null in corner case(e.g. 
> node crash , IOException), some of their callers have  _!=null_ check but 
> some do not have. In this issue we post a patch which can add  !=null  based 
> on existed !=null  check. For example:
> ZkGrep#parseZkSnapshot:
> {code:java}
>   return retFiles;
> } catch (Exception e) {
>   LOG.error("fail to parse zkSnapshot: " + lastZkSnapshot, e);
> }
> return null;{code}
> So parseZkSnapshot will return null while IOException happens. but its caller 
> ZkGrep#processCommandLineArgs have no null checker:
> {code:java}
> File[] lastZkSnapshot = parseZkSnapshot(zkDataDirs[1], byTime);
> // lastZkSnapshot[1] is the parsed last snapshot by byTime
> grepZkSnapshot(lastZkSnapshot[1], patterns);
> {code}
> We should terminate the process while  lastZkSnapshot == null 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message

2018-04-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451476#comment-16451476
 ] 

ASF GitHub Bot commented on HELIX-681:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/197


> Participant should not fail state transition on fail to delete / relay message
> --
>
> Key: HELIX-681
> URL: https://issues.apache.org/jira/browse/HELIX-681
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently we have a general try-catch block in HelixTask and 
> HelixTaskExecutor, which, upon any exception thrown from state transition 
> routine, will fail state transition. However there are at least the following 
> cases in which state transition should be considered as successful:
>  * When we fail to delete message after successfully handled message and 
> updated current state -> this is because we already completed state 
> transition and current state is consistent between participant and ZK
>  * When we fail to send out relay message > as relay message provides only 
> best effort of delivering messages, which has nothing to do with state 
> transition's results. In case of fail to relay message, controller will 
> resend message which ensures correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message

2018-04-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451423#comment-16451423
 ] 

ASF GitHub Bot commented on HELIX-681:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/197

[HELIX-681] change controller msg purge timeout to larger number

Changed message purge delay to 1min, updated tests accordingly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/ctl-msg-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/197.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #197


commit 4e02cbb9945279b7085e5c725b9d966b90086cc7
Author: Harry Zhang 
Date:   2018-04-24T23:46:14Z

[HELIX-681] change controller msg purge timeout to larger number




> Participant should not fail state transition on fail to delete / relay message
> --
>
> Key: HELIX-681
> URL: https://issues.apache.org/jira/browse/HELIX-681
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently we have a general try-catch block in HelixTask and 
> HelixTaskExecutor, which, upon any exception thrown from state transition 
> routine, will fail state transition. However there are at least the following 
> cases in which state transition should be considered as successful:
>  * When we fail to delete message after successfully handled message and 
> updated current state -> this is because we already completed state 
> transition and current state is consistent between participant and ZK
>  * When we fail to send out relay message > as relay message provides only 
> best effort of delivering messages, which has nothing to do with state 
> transition's results. In case of fail to relay message, controller will 
> resend message which ensures correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-682) Stale message should not prevent controller from rebalancing resource

2018-04-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451357#comment-16451357
 ] 

ASF GitHub Bot commented on HELIX-682:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/195


> Stale message should not prevent controller from rebalancing resource
> -
>
> Key: HELIX-682
> URL: https://issues.apache.org/jira/browse/HELIX-682
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently during MessageGenerationPhase, we skip re-balancing when there is 
> pending message. Though we assume that participant will delete messages when 
> they finish the task, there will be cases that when ZK is not stable and 
> participant fail to do so, which will leave message un-deleted and thus block 
> rebalance.
> Ideally on controller side, we should try to delete message as well: if 
> partition's current state is same as message's toState, or there is totally 
> invalid message remaining, controller should try to delete message to unblock 
> rebalancing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-682) Stale message should not prevent controller from rebalancing resource

2018-04-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451348#comment-16451348
 ] 

ASF GitHub Bot commented on HELIX-682:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/195

[HELIX-682] delete duplicated message and log error in HelixTaskExecutor on 
participant

This PR is the second part of message dedup on participant side

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/participant-msg-dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit 8aba9bea0734da11722fbc8cceb74f34dd6a37c6
Author: Harry Zhang 
Date:   2018-04-24T22:34:08Z

[HELIX-682] delete duplicated message and log error in HelixTaskExecutor on 
participant




> Stale message should not prevent controller from rebalancing resource
> -
>
> Key: HELIX-682
> URL: https://issues.apache.org/jira/browse/HELIX-682
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently during MessageGenerationPhase, we skip re-balancing when there is 
> pending message. Though we assume that participant will delete messages when 
> they finish the task, there will be cases that when ZK is not stable and 
> participant fail to do so, which will leave message un-deleted and thus block 
> rebalance.
> Ideally on controller side, we should try to delete message as well: if 
> partition's current state is same as message's toState, or there is totally 
> invalid message remaining, controller should try to delete message to unblock 
> rebalancing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-674) Constraint Based Resource Rebalancer

2018-04-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449076#comment-16449076
 ] 

ASF GitHub Bot commented on HELIX-674:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/145


> Constraint Based Resource Rebalancer
> 
>
> Key: HELIX-674
> URL: https://issues.apache.org/jira/browse/HELIX-674
> Project: Apache Helix
>  Issue Type: New Feature
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
> Fix For: 0.8.x
>
> Attachments: Constraint-BasedResourceRebalancing-080318-2226-240.pdf
>
>
> Helix rebalancer assigns resources according to different strategies. 
> Recently, we optimize the strategy for evenness and minimize movement. 
> However, the evenness here only applies to partition numbers. Moreover, we've 
> got more requests for customizable rebalancer from our users.
> Take partition weight as an example:
> In reality, partition replicas have different size. We use "partition weight" 
> as an abstraction of the partition size. It can be network traffic usage, 
> disk usage, or any other combined factors.
> Given each partition may have different weights, Helix should be able to 
> assign partition accordingly. So that the distribution would be even 
> regarding the weight.
> In this project, we are planning new rebalancer mechanism that generates 
> resource partition assignment according to a list of "constraints". Current 
> rebalance strategy can be regarded as one kind of constraint. Moving forward, 
> Helix users would be able to extend the constraint interface using their own 
> logic.
> Some init discussions are in progress and we will have a proposal posted here 
> soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-674) Constraint Based Resource Rebalancer

2018-04-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448868#comment-16448868
 ] 

ASF GitHub Bot commented on HELIX-674:
--

Github user lei-xia commented on the issue:

https://github.com/apache/helix/pull/145
  
Can you rebase to HEAD?


> Constraint Based Resource Rebalancer
> 
>
> Key: HELIX-674
> URL: https://issues.apache.org/jira/browse/HELIX-674
> Project: Apache Helix
>  Issue Type: New Feature
>Reporter: Jiajun Wang
>Assignee: Jiajun Wang
>Priority: Major
> Fix For: 0.8.x
>
> Attachments: Constraint-BasedResourceRebalancing-080318-2226-240.pdf
>
>
> Helix rebalancer assigns resources according to different strategies. 
> Recently, we optimize the strategy for evenness and minimize movement. 
> However, the evenness here only applies to partition numbers. Moreover, we've 
> got more requests for customizable rebalancer from our users.
> Take partition weight as an example:
> In reality, partition replicas have different size. We use "partition weight" 
> as an abstraction of the partition size. It can be network traffic usage, 
> disk usage, or any other combined factors.
> Given each partition may have different weights, Helix should be able to 
> assign partition accordingly. So that the distribution would be even 
> regarding the weight.
> In this project, we are planning new rebalancer mechanism that generates 
> resource partition assignment according to a list of "constraints". Current 
> rebalance strategy can be regarded as one kind of constraint. Moving forward, 
> Helix users would be able to extend the constraint interface using their own 
> logic.
> Some init discussions are in progress and we will have a proposal posted here 
> soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-696) Workflow state messed up after timeout, and is not cleaned

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446024#comment-16446024
 ] 

ASF GitHub Bot commented on HELIX-696:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/191


> Workflow state messed up after timeout, and is not cleaned
> --
>
> Key: HELIX-696
> URL: https://issues.apache.org/jira/browse/HELIX-696
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Couple of problems with current workflow finish handling logic:
>  # After timeout, timer is not scheduled to clean it up when workflow expires
>  # After timeout, state handling logic is messy that previously stopped 
> workflow states flip-flop between TIMED_OUT and STOPPED
>  # MBean is not updated correctly as we update latency before setting finish 
> time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-696) Workflow state messed up after timeout, and is not cleaned

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444970#comment-16444970
 ] 

ASF GitHub Bot commented on HELIX-696:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/191

[HELIX-696] fix workflow state flip-flop issue

Fixed issues:
*After timeout, timer is not scheduled to clean it up when workflow expires
*After timeout, state handling logic is messy that previously stopped 
workflow states flip-flop between TIMED_OUT and STOPPED
*MBean is not updated correctly as we update latency before setting finish 
time

Added tests to verify that changes work.

Note that currently task framework logic is messy, and this PR tried to 
focus on fixing issues rather than a major refactor, which is enough provide 
that we are working on task framework 2.0

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/workflow-state-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #191


commit ecacb49b33327ec2dd3d50471f88c99d138ba24c
Author: Harry Zhang 
Date:   2018-04-19T23:13:13Z

[HELIX-696] fix workflow state flip-flop issue




> Workflow state messed up after timeout, and is not cleaned
> --
>
> Key: HELIX-696
> URL: https://issues.apache.org/jira/browse/HELIX-696
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Couple of problems with current workflow finish handling logic:
>  # After timeout, timer is not scheduled to clean it up when workflow expires
>  # After timeout, state handling logic is messy that previously stopped 
> workflow states flip-flop between TIMED_OUT and STOPPED
>  # MBean is not updated correctly as we update latency before setting finish 
> time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-699) Compare InstanceConfigs using their IDs in RoutingTable

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444811#comment-16444811
 ] 

ASF GitHub Bot commented on HELIX-699:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/188


> Compare InstanceConfigs using their IDs in RoutingTable
> ---
>
> Key: HELIX-699
> URL: https://issues.apache.org/jira/browse/HELIX-699
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> A possible race condition was causing a NPE on InstanceConfig.getHostName(). 
> Instead of comparing hostnames and ports, we compare IDs, which are supposed 
> to be concatenation of instance name, hostname, and port anyways and should 
> always be set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-698) Add periodic refresh to RoutingTableProvider

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444800#comment-16444800
 ] 

ASF GitHub Bot commented on HELIX-698:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/187


> Add periodic refresh to RoutingTableProvider 
> -
>
> Key: HELIX-698
> URL: https://issues.apache.org/jira/browse/HELIX-698
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> There have been incidents where RoutingTableProvider was not getting a proper 
> refresh potentially due to the lag in ZKClient CallbackHandler or 
> connectivity issues. This addition of periodic refresh avoids cases where 
> RoutingTableProvider is severely delayed by initiating periodic refreshes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-697) Add cluster level metrics in ClusterStatusMonitor

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444788#comment-16444788
 ] 

ASF GitHub Bot commented on HELIX-697:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/186


> Add cluster level metrics in ClusterStatusMonitor
> -
>
> Key: HELIX-697
> URL: https://issues.apache.org/jira/browse/HELIX-697
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> Add cluster level metrics in ClusterStatusMonitor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-699) Compare InstanceConfigs using their IDs in RoutingTable

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444766#comment-16444766
 ] 

ASF GitHub Bot commented on HELIX-699:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/188

[HELIX-699] Compare InstanceConfigs using their IDs in RoutingTable

A possible race condition was causing a NPE on 
InstanceConfig.getHostName(). Instead of comparing hostnames and ports, we 
compare IDs, which are supposed to be concatenation of instance name, hostname, 
and port anyways and should always be set.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix instConfigNullCheck

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #188


commit 2b076c1f97dca95ef4ad817fd45d47c1ec4ff337
Author: Hunter Lee 
Date:   2018-04-19T20:47:28Z

[HELIX-699] Compare InstanceConfigs using their IDs in RoutingTable

A possible race condition was causing a NPE on 
InstanceConfig.getHostName(). Instead of comparing hostnames and ports, we 
compare IDs, which are supposed to be concatenation of instance name, hostname, 
and port anyways and should always be set.




> Compare InstanceConfigs using their IDs in RoutingTable
> ---
>
> Key: HELIX-699
> URL: https://issues.apache.org/jira/browse/HELIX-699
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> A possible race condition was causing a NPE on InstanceConfig.getHostName(). 
> Instead of comparing hostnames and ports, we compare IDs, which are supposed 
> to be concatenation of instance name, hostname, and port anyways and should 
> always be set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-698) Add periodic refresh to RoutingTableProvider

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444758#comment-16444758
 ] 

ASF GitHub Bot commented on HELIX-698:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/187

[HELIX-698] Add periodic refresh to RoutingTableProvider

There have been incidents where RoutingTableProvider was not getting a 
proper refresh potentially due to the lag in ZKClient CallbackHandler or 
connectivity issues. This addition of periodic refresh avoids cases where 
RoutingTableProvider is severely delayed by initiating periodic refreshes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix periodicRefresh

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #187


commit efc1e81c24c23c4dfc61c85c14708533d30b032c
Author: Hunter Lee 
Date:   2018-04-19T20:42:37Z

[HELIX-698] Add periodic refresh to RoutingTableProvider

There have been incidents where RoutingTableProvider was not getting a 
proper refresh potentially due to the lag in ZKClient CallbackHandler or 
connectivity issues. This addition of periodic refresh avoids cases where 
RoutingTableProvider is severely delayed by initiating periodic refreshes.




> Add periodic refresh to RoutingTableProvider 
> -
>
> Key: HELIX-698
> URL: https://issues.apache.org/jira/browse/HELIX-698
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> There have been incidents where RoutingTableProvider was not getting a proper 
> refresh potentially due to the lag in ZKClient CallbackHandler or 
> connectivity issues. This addition of periodic refresh avoids cases where 
> RoutingTableProvider is severely delayed by initiating periodic refreshes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-697) Add cluster level metrics in ClusterStatusMonitor

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444752#comment-16444752
 ] 

ASF GitHub Bot commented on HELIX-697:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/186

[HELIX-697] Add cluster level metrics in ClusterStatusMonitor



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix clusterLevelMetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #186


commit e1faf2404c3bb74aab7c402d76246b41af74fd16
Author: Hunter Lee 
Date:   2018-04-19T20:33:54Z

[HELIX-697] Add cluster level metrics in ClusterStatusMonitor




> Add cluster level metrics in ClusterStatusMonitor
> -
>
> Key: HELIX-697
> URL: https://issues.apache.org/jira/browse/HELIX-697
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> Add cluster level metrics in ClusterStatusMonitor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-690) Batch message should not share same NotificationContext object to update CurrentState

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444513#comment-16444513
 ] 

ASF GitHub Bot commented on HELIX-690:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/181


> Batch message should not share same NotificationContext object to update 
> CurrentState
> -
>
> Key: HELIX-690
> URL: https://issues.apache.org/jira/browse/HELIX-690
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently batch message has bugs:
>  1. Batch message is triggering a lot of duplicated state transition messages 
> sent from controller, result in "state does not match" error on participant 
> side. This will further create a lot of ERROR znodes in ZK, which adds up 
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] 
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while 
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: 
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at 
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at 
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all 
> HelixTask objects from same batch of messages are sharing the same 
> changeContext object. For batch message, HelixTask will create current state 
> update map to record current state updates, and therefore result in a racing 
> condition in current state recording - it is very normal that due to such 
> bug, resource's current state is changed on participant side, current state 
> is not updated in ZK, and after message is removed, controller still think 
> that state transition is not finished, and send duplicated state transition 
> message.
>  
> The error situation will only be triggered when the load is high, so not 
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object 
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data 
> sets, and it worked.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-695) Add Helix Manager listener for new connection notification

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1692#comment-1692
 ] 

ASF GitHub Bot commented on HELIX-695:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/182


> Add Helix Manager listener for new connection notification
> --
>
> Key: HELIX-695
> URL: https://issues.apache.org/jira/browse/HELIX-695
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Currently HelixManager is not notifying state listener about connection 
> establishment. Adding this notification is useful since HelixManager supports 
> get ZkClient method and when connection is re-established, ZkClient is newly 
> created and users who used get method to extract client should be notified 
> and refresh their client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-695) Add Helix Manager listener for new connection notification

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439730#comment-16439730
 ] 

ASF GitHub Bot commented on HELIX-695:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/182

[HELIX-695] add helix manager listener for new connection notification

In this PR I added invocation and related tests of 
`stateListener.onConnected()` method in ZkHelixManager when it is connected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-manager-onconnected

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #182


commit 65e84713503437c542e545abd521c2ba6d26
Author: Harry Zhang 
Date:   2018-04-16T17:05:30Z

[HELIX-695] add helix manager listener for new connection notification




> Add Helix Manager listener for new connection notification
> --
>
> Key: HELIX-695
> URL: https://issues.apache.org/jira/browse/HELIX-695
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Currently HelixManager is not notifying state listener about connection 
> establishment. Adding this notification is useful since HelixManager supports 
> get ZkClient method and when connection is re-established, ZkClient is newly 
> created and users who used get method to extract client should be notified 
> and refresh their client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-690) Batch message should not share same NotificationContext object to update CurrentState

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439716#comment-16439716
 ] 

ASF GitHub Bot commented on HELIX-690:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/181

[HELIX-690] batch message execution should not share same context

In this PR, I added deep copy methods to NotificationContext so when 
processing messages in batch, different thread would not share the same 
notification context.

This solves the problem that when processing BatchMessages, each thread 
will have their own current state delta to work on, so current states won't be 
messed up.

Also modified some logs to make it more useful when debugging

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/batch-msg-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #181


commit bb7751b0f52aadcf04b7813fa3e99c8e266a3d0b
Author: Harry Zhang 
Date:   2018-04-16T16:55:43Z

[HELIX-690] batch message execution should not share same context




> Batch message should not share same NotificationContext object to update 
> CurrentState
> -
>
> Key: HELIX-690
> URL: https://issues.apache.org/jira/browse/HELIX-690
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently batch message has bugs:
>  1. Batch message is triggering a lot of duplicated state transition messages 
> sent from controller, result in "state does not match" error on participant 
> side. This will further create a lot of ERROR znodes in ZK, which adds up 
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] 
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while 
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: 
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at 
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at 
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all 
> HelixTask objects from same batch of messages are sharing the same 
> changeContext object. For batch message, HelixTask will create current state 
> update map to record current state updates, and therefore result in a racing 
> condition in current state recording - it is very normal that due to such 
> bug, resource's current state is changed on participant side, current state 
> is not updated in ZK, and after message is removed, controller still think 
> that state transition is not finished, and send duplicated state transition 
> message.
>  
> The error situation will only be triggered when the load is high, so not 
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object 
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data 
> sets, and it worked.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-689) Controller message cleanup is spitting too many logs

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431109#comment-16431109
 ] 

ASF GitHub Bot commented on HELIX-689:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/173


> Controller message cleanup is spitting too many logs
> 
>
> Key: HELIX-689
> URL: https://issues.apache.org/jira/browse/HELIX-689
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently we print out error log when we fail to remove logs. However, due to 
> ZK client limitation, we are printing logs even when the message is already 
> deleted, which should not be regarded as a failure
> Need to perform log cleanup and only print out log when there is real error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-692) Use map instead of set in controller's message cleanup logic

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431108#comment-16431108
 ] 

ASF GitHub Bot commented on HELIX-692:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/175


> Use map instead of set in controller's message cleanup logic
> 
>
> Key: HELIX-692
> URL: https://issues.apache.org/jira/browse/HELIX-692
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> This is to avoid duplicated cleans of same message, as under batch message 
> mode, we are storing same message under all resources and therefore causing 
> extra deletion api calls for same message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431104#comment-16431104
 ] 

ASF GitHub Bot commented on HELIX-691:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/176


> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431069#comment-16431069
 ] 

ASF GitHub Bot commented on HELIX-691:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/176

[HELIX-691] Allow users to update InstanceConfig

In helix-rest, we provide a method in InstanceAccessor, 
updateInstanceConfig, that updates the instance's config through a POST call.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix instConfig2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #176


commit 72d52484716bf2f90323f141a478893fdd0843f1
Author: narendly 
Date:   2018-04-09T19:04:26Z

[HELIX-691] Allow users to update InstanceConfig

In helix-rest, we provide a method in InstanceAccessor, 
updateInstanceConfig, that updates the instance's config through a POST call.




> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431061#comment-16431061
 ] 

ASF GitHub Bot commented on HELIX-691:
--

Github user narendly closed the pull request at:

https://github.com/apache/helix/pull/174


> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430971#comment-16430971
 ] 

ASF GitHub Bot commented on HELIX-691:
--

Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180175528
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java
 ---
@@ -223,60 +224,60 @@ public Response 
updateInstance(@PathParam("clusterId") String clusterId,
   }
 
   switch (cmd) {
-  case enable:
-admin.enableInstance(clusterId, instanceName, true);
-break;
-  case disable:
-admin.enableInstance(clusterId, instanceName, false);
-break;
-  case reset:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-admin.resetPartition(clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).toString(), 
(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-OBJECT_MAPPER.getTypeFactory()
-.constructCollectionType(List.class, 
String.class)));
-break;
-  case addInstanceTag:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-for (String tag : (List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.instanceTags.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class))) {
-  admin.addInstanceTag(clusterId, instanceName, tag);
-}
-break;
-  case removeInstanceTag:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-for (String tag : (List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.instanceTags.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class))) {
-  admin.removeInstanceTag(clusterId, instanceName, tag);
-}
-break;
-  case enablePartitions:
-admin.enablePartition(true, clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).getTextValue(),
-(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-OBJECT_MAPPER.getTypeFactory()
-.constructCollectionType(List.class, 
String.class)));
-break;
-  case disablePartitions:
-admin.enablePartition(false, clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).getTextValue(),
-(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class)));
-break;
-  default:
-_logger.error("Unsupported command :" + command);
-return badRequest("Unsupported command :" + command);
+case enable:
--- End diff --

Helix's formatter does not indent case, could you pls revert it back? Same 
for other places


> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430970#comment-16430970
 ] 

ASF GitHub Bot commented on HELIX-691:
--

Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180178181
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java
 ---
@@ -315,26 +316,27 @@ public Response 
getInstanceConfig(@PathParam("clusterId") String clusterId,
 return notFound();
   }
 
-  @PUT
+  @POST
--- End diff --

PUT is for "set" and POST is for "patch", I'd suggest we keep both. 
@dasahcc thoughts?


> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430969#comment-16430969
 ] 

ASF GitHub Bot commented on HELIX-691:
--

Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180174778
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/ResourceAccessor.java
 ---
@@ -84,10 +96,66 @@ public Response getResources(@PathParam("clusterId") 
String clusterId) {
 return JSONRepresentation(root);
   }
 
+  /**
--- End diff --

Partition health related changes are not part of this PR (allow user to 
change instance config), can we file different issues?


> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-692) Use map instead of set in controller's message cleanup logic

2018-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429130#comment-16429130
 ] 

ASF GitHub Bot commented on HELIX-692:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/175

[HELIX-692] use map instead of list to avoid deleting redundant message 
during cleanup

Currently in MessageGenerationPhase, we are using list to store messages to 
GC. However, pending message is stored per resource/partition/instance, and 
under batch message mode, same message is stored once for each partition in the 
batch, which lead to the fact that we are cleaning up same message a lot of 
times.


This RB changes list to map to avoid redundant cleanup

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/HELIX-692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #175


commit ce7d2e9d275e1403375edd94d63afa94ea1a2234
Author: Harry Zhang 
Date:   2018-04-06T23:27:02Z

[HELIX-692] use map instead of list to avoid deleting redundant message 
during cleanup




> Use map instead of set in controller's message cleanup logic
> 
>
> Key: HELIX-692
> URL: https://issues.apache.org/jira/browse/HELIX-692
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> This is to avoid duplicated cleans of same message, as under batch message 
> mode, we are storing same message under all resources and therefore causing 
> extra deletion api calls for same message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig

2018-04-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427722#comment-16427722
 ] 

ASF GitHub Bot commented on HELIX-691:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/174

[HELIX-691] Allow users to update InstanceConfig

In helix-rest, we provide a method in InstanceAccessor, 
updateInstanceConfig, that updates the instance's config through a POST call.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix instConfig

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #174


commit 1ebdc45194dddecb675362e812aa795c876c9f6a
Author: narendly 
Date:   2018-04-05T23:00:52Z

[HELIX-691] Allow users to update InstanceConfig

In helix-rest, we provide a method in InstanceAccessor, 
updateInstanceConfig, that updates the instance's config through a POST call.




> Allow users to update InstanceConfig
> 
>
> Key: HELIX-691
> URL: https://issues.apache.org/jira/browse/HELIX-691
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig 
> updates the instance's config through a POST call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-689) Controller message cleanup is spitting too many logs

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424634#comment-16424634
 ] 

ASF GitHub Bot commented on HELIX-689:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/173

[HELIX-689] remove redundant logs from zkclient

Currently, in controller message cleanup, we print out 2 lines of message 
when message does not exist, which is totally redundant. In this PR, I removed 
the warning message from controller, and added error message in zkclient only 
when there is real error (exception from below). If we fail to delete a ZNode 
because znode does not exist, we do not print out message any more except debug 
mode

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/ctl-msg-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit d96a40caf19efffed3939b6dd8d9efe40734ec15
Author: Harry Zhang 
Date:   2018-04-03T21:22:53Z

[HELIX-689] remove redundant logs from zkclient




> Controller message cleanup is spitting too many logs
> 
>
> Key: HELIX-689
> URL: https://issues.apache.org/jira/browse/HELIX-689
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently we print out error log when we fail to remove logs. However, due to 
> ZK client limitation, we are printing logs even when the message is already 
> deleted, which should not be regarded as a failure
> Need to perform log cleanup and only print out log when there is real error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-688) Add method that returns start time of the most recent task scheduled

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418292#comment-16418292
 ] 

ASF GitHub Bot commented on HELIX-688:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/170


> Add method that returns start time of the most recent task scheduled
> 
>
> Key: HELIX-688
> URL: https://issues.apache.org/jira/browse/HELIX-688
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> getLastScheduledTaskTimestamp returns the timestamp for the start time of the 
> task that was scheduled last. Clients of Task Framework may use this API 
> against their time to completion metric to determine if a given 
> workflow/job/task is stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-688) Add method that returns start time of the most recent task scheduled

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416462#comment-16416462
 ] 

ASF GitHub Bot commented on HELIX-688:
--

GitHub user narendly opened a pull request:

https://github.com/apache/helix/pull/170

[HELIX-688] Add method that returns start time of the most recent tas…

…k scheduled

getLastScheduledTaskTimestamp returns the timestamp for the start time of 
the task that was scheduled last. Clients of Task Framework may use this API 
against their time to completion metric to determine if a given 
workflow/job/task is stuck.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/narendly/helix lasttask

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #170


commit 13b131f7c2ec4c40e92beecb2552969157bd4882
Author: narendly 
Date:   2018-03-27T23:25:22Z

[HELIX-688] Add method that returns start time of the most recent task 
scheduled

getLastScheduledTaskTimestamp returns the timestamp for the start time of 
the task that was scheduled last. Clients of Task Framework may use this API 
against their time to completion metric to determine if a given 
workflow/job/task is stuck.




> Add method that returns start time of the most recent task scheduled
> 
>
> Key: HELIX-688
> URL: https://issues.apache.org/jira/browse/HELIX-688
> Project: Apache Helix
>  Issue Type: Improvement
>Reporter: Hunter L
>Priority: Major
>
> getLastScheduledTaskTimestamp returns the timestamp for the start time of the 
> task that was scheduled last. Clients of Task Framework may use this API 
> against their time to completion metric to determine if a given 
> workflow/job/task is stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-683) Clean monitoring cache upon helix controller enable monitoring

2018-03-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414712#comment-16414712
 ] 

ASF GitHub Bot commented on HELIX-683:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/162


> Clean monitoring cache upon helix controller enable monitoring
> --
>
> Key: HELIX-683
> URL: https://issues.apache.org/jira/browse/HELIX-683
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> We found a bug in reporting cluster status, partition masterless duration.
> The root cause is that the duration is calculated based on controller cache. 
> And currently, this cache is not cleaned when leadership is changed. As a 
> result, if controller A start a mastership handoff but was interrupted once, 
> the start time will be kept in cache until next mastership handoff on the 
> same partition happens. Then the later handoff duration will be calculated 
> based on the stale start time. This could be super large.
> To fix it, we might consider clean cache when leadership changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-683) Clean monitoring cache upon helix controller enable monitoring

2018-03-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414392#comment-16414392
 ] 

ASF GitHub Bot commented on HELIX-683:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/162

[HELIX-683] clean monitoring cache upon helix controller enable monitoring

In this PR I added methods to clear monitoring records in cache when we 
enable cluster status monitoring. I also added tests to reproduce situation 
that a resource missed top state, controller lost leadership, resource regain 
top state, controller regain leadership, which will cause a metrics reporting 
problem

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix 
harry/controller-monitor-cache-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #162


commit 373da77547fa1ea4a39c760e80da75e9d453d4f5
Author: Harry Zhang 
Date:   2018-03-26T19:14:07Z

[HELIX-683] clean monitoring cache upon helix controller enable monitoring




> Clean monitoring cache upon helix controller enable monitoring
> --
>
> Key: HELIX-683
> URL: https://issues.apache.org/jira/browse/HELIX-683
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> We found a bug in reporting cluster status, partition masterless duration.
> The root cause is that the duration is calculated based on controller cache. 
> And currently, this cache is not cleaned when leadership is changed. As a 
> result, if controller A start a mastership handoff but was interrupted once, 
> the start time will be kept in cache until next mastership handoff on the 
> same partition happens. Then the later handoff duration will be calculated 
> based on the stale start time. This could be super large.
> To fix it, we might consider clean cache when leadership changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message

2018-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412891#comment-16412891
 ] 

ASF GitHub Bot commented on HELIX-681:
--

Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/152


> Participant should not fail state transition on fail to delete / relay message
> --
>
> Key: HELIX-681
> URL: https://issues.apache.org/jira/browse/HELIX-681
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently we have a general try-catch block in HelixTask and 
> HelixTaskExecutor, which, upon any exception thrown from state transition 
> routine, will fail state transition. However there are at least the following 
> cases in which state transition should be considered as successful:
>  * When we fail to delete message after successfully handled message and 
> updated current state -> this is because we already completed state 
> transition and current state is consistent between participant and ZK
>  * When we fail to send out relay message > as relay message provides only 
> best effort of delivering messages, which has nothing to do with state 
> transition's results. In case of fail to relay message, controller will 
> resend message which ensures correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   >