[jira] [Commented] (HELIX-785) Report helix latency instead of user latency during top state handoff
[ https://issues.apache.org/jira/browse/HELIX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673659#comment-16673659 ] ASF GitHub Bot commented on HELIX-785: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/292 > Report helix latency instead of user latency during top state handoff > - > > Key: HELIX-785 > URL: https://issues.apache.org/jira/browse/HELIX-785 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Currently we are reporting top state handoff user latency, but we should > report Helix latency instead. user should have their way of monitoring their > own state transitions. > AC: > 1. Implement reporting Helix latency for top state handoff and test it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-785) Report helix latency instead of user latency during top state handoff
[ https://issues.apache.org/jira/browse/HELIX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673635#comment-16673635 ] ASF GitHub Bot commented on HELIX-785: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/292 [HELIX-785] Record helix latency instead of user latency in top state handoff metrics - top state handoff reports helix latency instead of user latency - modified test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/top-state-handoff-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #292 commit 37a58cfff91fb5f6608a4a06d1922bb5a5eb9ca1 Author: Harry Zhang Date: 2018-11-02T18:30:15Z [HELIX-785] Record helix latency instead of user latency in top state handoff metrics > Report helix latency instead of user latency during top state handoff > - > > Key: HELIX-785 > URL: https://issues.apache.org/jira/browse/HELIX-785 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Currently we are reporting top state handoff user latency, but we should > report Helix latency instead. user should have their way of monitoring their > own state transitions. > AC: > 1. Implement reporting Helix latency for top state handoff and test it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content
[ https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672463#comment-16672463 ] ASF GitHub Bot commented on HELIX-780: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/289 > Support get/add rest api for workflow/job/task user content > --- > > Key: HELIX-780 > URL: https://issues.apache.org/jira/browse/HELIX-780 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support get/add rest api for workflow/job/task user content > AC: > * finish implementation > * test code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content
[ https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672412#comment-16672412 ] ASF GitHub Bot commented on HELIX-780: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/289 [HELIX-780] add task user content related api and added more tests - added get/add task user content rest api - consolidated rest api behavior: when getting/adding user content, if job/workflow does not exist, throw 404 - added more test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/tf-rest-api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit 18aa67b6d5c703e5b938b2f915f52a6ca856e889 Author: Harry Zhang Date: 2018-10-09T21:31:00Z [HELIX-780] add task user content related api and added more tests > Support get/add rest api for workflow/job/task user content > --- > > Key: HELIX-780 > URL: https://issues.apache.org/jira/browse/HELIX-780 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support get/add rest api for workflow/job/task user content > AC: > * finish implementation > * test code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content
[ https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672227#comment-16672227 ] ASF GitHub Bot commented on HELIX-780: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/287 > Support get/add rest api for workflow/job/task user content > --- > > Key: HELIX-780 > URL: https://issues.apache.org/jira/browse/HELIX-780 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support get/add rest api for workflow/job/task user content > AC: > * finish implementation > * test code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content
[ https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672053#comment-16672053 ] ASF GitHub Bot commented on HELIX-780: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/287 [HELIX-780] add get/add job user content rest api added apis and tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/tf-rest-api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #287 commit a09a18ac55464c3e399800b4474ccb6e64d168ec Author: Harry Zhang Date: 2018-10-08T22:36:53Z [HELIX-780] add get/add job user content rest api > Support get/add rest api for workflow/job/task user content > --- > > Key: HELIX-780 > URL: https://issues.apache.org/jira/browse/HELIX-780 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support get/add rest api for workflow/job/task user content > AC: > * finish implementation > * test code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-780) Support get/add rest api for workflow/job/task user content
[ https://issues.apache.org/jira/browse/HELIX-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672032#comment-16672032 ] ASF GitHub Bot commented on HELIX-780: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/286 > Support get/add rest api for workflow/job/task user content > --- > > Key: HELIX-780 > URL: https://issues.apache.org/jira/browse/HELIX-780 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support get/add rest api for workflow/job/task user content > AC: > * finish implementation > * test code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-779) Maintenance rebalancer should not clear preference list in ideal state
[ https://issues.apache.org/jira/browse/HELIX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671989#comment-16671989 ] ASF GitHub Bot commented on HELIX-779: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/285 > Maintenance rebalancer should not clear preference list in ideal state > -- > > Key: HELIX-779 > URL: https://issues.apache.org/jira/browse/HELIX-779 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Setting list fields to empty map will prevent newly added and initially > rebalanced resources during maintenance mode from getting re-balanced after > cluster exists maintenance mode. > The right thing to do is to clear every preference list. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-779) Maintenance rebalancer should not clear preference list in ideal state
[ https://issues.apache.org/jira/browse/HELIX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671943#comment-16671943 ] ASF GitHub Bot commented on HELIX-779: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/285 [HELIX-779] do not clean list field in maintenance rebalancer for new resources Setting list fields to empty map will prevent newly added and initially rebalanced resources during maintenance mode from getting re-balanced after cluster exists maintenance mode. The right thing to do is to clear every preference list. Also added test case to verify You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/maintenance-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/285.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #285 commit bfaa8399529b6e63b307c1fbe60903c3ca08fbb1 Author: Harry Zhang Date: 2018-10-04T22:50:16Z [HELIX-779] do not clean list field in maintenance rebalancer for new resources > Maintenance rebalancer should not clear preference list in ideal state > -- > > Key: HELIX-779 > URL: https://issues.apache.org/jira/browse/HELIX-779 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Setting list fields to empty map will prevent newly added and initially > rebalanced resources during maintenance mode from getting re-balanced after > cluster exists maintenance mode. > The right thing to do is to clear every preference list. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content
[ https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670734#comment-16670734 ] ASF GitHub Bot commented on HELIX-775: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/283 > Task driver should support add/get task framework user content > -- > > Key: HELIX-775 > URL: https://issues.apache.org/jira/browse/HELIX-775 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Task driver should support add/get task framework user content at > workflow/job/task levels > > AC: > * finish implementation > * add tests -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content
[ https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670730#comment-16670730 ] ASF GitHub Bot commented on HELIX-775: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/283 [HELIX-775] consolidate user content related apis for task driver HELIX-1315: consolidate user content related apis for task driver To consolidate task driver user content related apis, and corresponding rest apis, I'm deprecating the general getUserContent() api, but instead, we now have the following apis for get / add / update user content. ```java public void addOrUpdateWorkflowUserContentMap(String workflowName, final Map contentToAddOrUpdate); public void addOrUpdateJobUserContentMap(String workflowName, String jobName, final Map contentToAddOrUpdate); public void addOrUpdateTaskUserContentMap(String workflowName, String jobName, String taskPartitionId, final Map contentToAddOrUpdate); public Map getWorkflowUserContentMap(String workflowName); public Map getJobUserContentMap(String workflowName, String jobName); public Map getTaskUserContentMap(String workflowName, String jobName, String taskPartitionId); ``` delete user content api tbd but can use the same convension You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/task-user-content Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/283.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #283 commit b235c4ee5a82c5970d29e839317ea242813a58bc Author: Harry Zhang Date: 2018-10-04T18:25:08Z [HELIX-775] consolidate user content related apis for task driver > Task driver should support add/get task framework user content > -- > > Key: HELIX-775 > URL: https://issues.apache.org/jira/browse/HELIX-775 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Task driver should support add/get task framework user content at > workflow/job/task levels > > AC: > * finish implementation > * add tests -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content
[ https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670714#comment-16670714 ] ASF GitHub Bot commented on HELIX-775: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/282 > Task driver should support add/get task framework user content > -- > > Key: HELIX-775 > URL: https://issues.apache.org/jira/browse/HELIX-775 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Task driver should support add/get task framework user content at > workflow/job/task levels > > AC: > * finish implementation > * add tests -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-775) Task driver should support add/get task framework user content
[ https://issues.apache.org/jira/browse/HELIX-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670711#comment-16670711 ] ASF GitHub Bot commented on HELIX-775: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/282 [HELIX-775] add task driver support for helix rest to add/get task fr… …amework user content consolidate user content related apis for task driver To consolidate task driver user content related apis, and corresponding rest apis, I'm deprecating the general getUserContent() api, but instead, we now have the following apis for get / add / update user content. ```java public void addOrUpdateWorkflowUserContentMap(String workflowName, final Map contentToAddOrUpdate); public void addOrUpdateJobUserContentMap(String workflowName, String jobName, final Map contentToAddOrUpdate); public void addOrUpdateTaskUserContentMap(String workflowName, String jobName, String taskPartitionId, final Map contentToAddOrUpdate); public Map getWorkflowUserContentMap(String workflowName); public Map getJobUserContentMap(String workflowName, String jobName); public Map getTaskUserContentMap(String workflowName, String jobName, String taskPartitionId); ``` API for deleting user content is TBD but can use the same convension You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/task-user-content Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/282.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #282 commit 7ec5313bccb679014d6a0605ee5d7184063e555e Author: Harry Zhang Date: 2018-10-31T20:55:44Z [HELIX-775] add task driver support for helix rest to add/get task framework user content > Task driver should support add/get task framework user content > -- > > Key: HELIX-775 > URL: https://issues.apache.org/jira/browse/HELIX-775 > Project: Apache Helix > Issue Type: Task >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Task driver should support add/get task framework user content at > workflow/job/task levels > > AC: > * finish implementation > * add tests -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-773) Support getLastScheduledTaskTimestamp information in workflow rest api
[ https://issues.apache.org/jira/browse/HELIX-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670702#comment-16670702 ] ASF GitHub Bot commented on HELIX-773: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/281 > Support getLastScheduledTaskTimestamp information in workflow rest api > -- > > Key: HELIX-773 > URL: https://issues.apache.org/jira/browse/HELIX-773 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Support getLastScheduledTaskTimestamp information in workflow rest api -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-772) Support TaskDriver.addUserContent() api
[ https://issues.apache.org/jira/browse/HELIX-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670698#comment-16670698 ] ASF GitHub Bot commented on HELIX-772: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/280 > Support TaskDriver.addUserContent() api > --- > > Key: HELIX-772 > URL: https://issues.apache.org/jira/browse/HELIX-772 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support add user content in task driver > > AC: > * implement APi > * add test > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-773) Support getLastScheduledTaskTimestamp information in workflow rest api
[ https://issues.apache.org/jira/browse/HELIX-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669438#comment-16669438 ] ASF GitHub Bot commented on HELIX-773: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/281 [HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest API - Added TaskExecutionInfo object to wrap task execution information - added TaskExecutionInfo to last scheduled task in workflow property in workflow rest API - Modified related tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/workflow-rest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/281.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #281 commit 917f6b7ee1b2b44b10eea7e5de7f07aa7f184618 Author: Harry Zhang Date: 2018-10-30T23:43:25Z [HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest api > Support getLastScheduledTaskTimestamp information in workflow rest api > -- > > Key: HELIX-773 > URL: https://issues.apache.org/jira/browse/HELIX-773 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Support getLastScheduledTaskTimestamp information in workflow rest api -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-772) Support TaskDriver.addUserContent() api
[ https://issues.apache.org/jira/browse/HELIX-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669436#comment-16669436 ] ASF GitHub Bot commented on HELIX-772: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/280 [HELIX-772] add TaskDriver.addUserContent() api and related tests Implemented TaskDriver.addUserContent() Added test (TestGetSetUserContentStore) for testing all getter/setter for user content Modified unstable TestIndependentTaskRebalancer You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/add-user-content Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #280 commit df24f5975bd517626490f14e6e038f8370ddd815 Author: Harry Zhang Date: 2018-10-30T23:25:12Z [HELIX-772] add TaskDriver.addUserContent() api and related tests > Support TaskDriver.addUserContent() api > --- > > Key: HELIX-772 > URL: https://issues.apache.org/jira/browse/HELIX-772 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Need to support add user content in task driver > > AC: > * implement APi > * add test > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-771) More detailed top state handoff metrics
[ https://issues.apache.org/jira/browse/HELIX-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669418#comment-16669418 ] ASF GitHub Bot commented on HELIX-771: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/278 > More detailed top state handoff metrics > --- > > Key: HELIX-771 > URL: https://issues.apache.org/jira/browse/HELIX-771 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > To define top state handoff SLA, we need some more detailed data: > * graceful top state handoff (i.e. disable instance / resource / etc, both > Helix and e2e latency) > * abrupt top state handoff (i.e. node crash) > AC: > - prepare metrics, test, code complete -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-771) More detailed top state handoff metrics
[ https://issues.apache.org/jira/browse/HELIX-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669413#comment-16669413 ] ASF GitHub Bot commented on HELIX-771: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/278 [HELIX-771] More detailed top state handoff metrics Added more details about top state handoff to distinguish helix latency and user latency We define there are 2 types of handoff - Graceful handoff (controlled top state handoff, i.e. disable instance, load balance, etc) - Non-Graceful (uncontroller top state handoff, i.e. node crash, etc) For graceful handoff, we record total handoff latency and user latency For non-graceful handoff, we record total handoff only Moved top state handoff metrics to an independent stage to make logics cleaner.\ Refactored TestTopStateHandoffmetrics to make it cleaner and more json more natively You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/topstate-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/278.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #278 commit 7e49f995e29ea200fcc42ce6af148ed521979f5c Author: Harry Zhang Date: 2018-10-30T22:55:20Z [HELIX-771] More detailed top state handoff metrics > More detailed top state handoff metrics > --- > > Key: HELIX-771 > URL: https://issues.apache.org/jira/browse/HELIX-771 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > To define top state handoff SLA, we need some more detailed data: > * graceful top state handoff (i.e. disable instance / resource / etc, both > Helix and e2e latency) > * abrupt top state handoff (i.e. node crash) > AC: > - prepare metrics, test, code complete -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-770) HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage
[ https://issues.apache.org/jira/browse/HELIX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667948#comment-16667948 ] ASF GitHub Bot commented on HELIX-770: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/277 > HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage > -- > > Key: HELIX-770 > URL: https://issues.apache.org/jira/browse/HELIX-770 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, > statePriorityMap was throwing a NPE because the partition contained a replica > in ERROR state, and the map did not have an entry for it. To amend the issue, > Venice added the ERROR state in the state model with a priority, and Helix > added checks to prevent NPEs. Changelist: 1. Add containsKey checks in > isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log > all partitions with ERROR state replicas 3. Add HelixDefinedStates in > statePriorityList if not already added -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-770) HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage
[ https://issues.apache.org/jira/browse/HELIX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667902#comment-16667902 ] ASF GitHub Bot commented on HELIX-770: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/277 [HELIX-770] HELIX: Fix a possible NPE in loadBalance in IntermediateS… …tateCalcStage In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, statePriorityMap was throwing a NPE because the partition contained a replica in ERROR state, and the map did not have an entry for it. To amend the issue, Venice added the ERROR state in the state model with a priority, and Helix added checks to prevent NPEs. Changelist: 1. Add containsKey checks in isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log all partitions with ERROR state replicas 3. Add HelixDefinedStates in statePriorityList if not already added You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/277.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #277 commit 7bc70e24abd89611580098670ed02b2736ccfac0 Author: Hunter Lee Date: 2018-10-29T23:50:41Z [HELIX-770] HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, statePriorityMap was throwing a NPE because the partition contained a replica in ERROR state, and the map did not have an entry for it. To amend the issue, Venice added the ERROR state in the state model with a priority, and Helix added checks to prevent NPEs. Changelist: 1. Add containsKey checks in isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log all partitions with ERROR state replicas 3. Add HelixDefinedStates in statePriorityList if not already added > HELIX: Fix a possible NPE in loadBalance in IntermediateStateCalcStage > -- > > Key: HELIX-770 > URL: https://issues.apache.org/jira/browse/HELIX-770 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > In isLoadBalanceDownwardForAllReplicas() in IntermediateStateCalcStage, > statePriorityMap was throwing a NPE because the partition contained a replica > in ERROR state, and the map did not have an entry for it. To amend the issue, > Venice added the ERROR state in the state model with a priority, and Helix > added checks to prevent NPEs. Changelist: 1. Add containsKey checks in > isLoadBalanceDownwardForAllReplicas() 2. Make the Controller correctly log > all partitions with ERROR state replicas 3. Add HelixDefinedStates in > statePriorityList if not already added -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-756) TASK: Change LOG mode from info to debug
[ https://issues.apache.org/jira/browse/HELIX-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665820#comment-16665820 ] ASF GitHub Bot commented on HELIX-756: -- Github user narendly closed the pull request at: https://github.com/apache/helix/pull/271 > TASK: Change LOG mode from info to debug > > > Key: HELIX-756 > URL: https://issues.apache.org/jira/browse/HELIX-756 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > In production, it was observed that some users were running thousands of > tasks, and since AssignableInstance leaves a line of log for each task > assigned or released, the amount of log that was being generated was too > much, and it was too verbose. > Changelist: > 1. Change the logging mode from info to debug in AssignableInstance and > AssignableInstanceManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-753) Record top state handoff finished in single cluster data cache refresh
[ https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664398#comment-16664398 ] ASF GitHub Bot commented on HELIX-753: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/270 > Record top state handoff finished in single cluster data cache refresh > -- > > Key: HELIX-753 > URL: https://issues.apache.org/jira/browse/HELIX-753 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Currently we are calculating top state handoff duration by doing the > following: > - record missing top state when we see a top state missing > - record top state come back when we see it come back > - report top state handoff duration > This is perfectly fine for non-P2P state transitions as the entire top state > handoff process will always finish for >= 2 pipeline runs. However, for P2P > enabled clusters, top state handoff are quick, and if it is quicker than > cluster data refresh stage latency, we will lose a lot of short top state > handoffs, which make the number miserable on ingraph. > We need to revise top state handoff metrics implementation so we don't lose > data point statistically (i.e. we are losing all short handoffs now). > AC: > - revise impl so we catch those short top state hand-offs > - write new tests to catch the fix if needed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-756) TASK: Change LOG mode from info to debug
[ https://issues.apache.org/jira/browse/HELIX-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627683#comment-16627683 ] ASF GitHub Bot commented on HELIX-756: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/271 [HELIX-756] TASK: Change LOG mode from info to debug You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/271.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #271 commit 5140db0c50439c115d0c7d2637f7ad723f6f147a Author: Hunter Lee Date: 2018-09-25T17:19:31Z [HELIX-754] TASK: Fix LiveInstanceCurrentState change flag Previously, existsLiveInstanceOrCurrentStateChange was getting reset in ClusterDataCache when its getter was called. This was problematic because if there were multiple jobs or multiple workflows, whoever calls this getter would get the correct flag value, and the ensuing callers would get a false because the flag would have been reset. This RB fixes that bug by reseting the flat right in the beginning of refresh() call in ClusterDataCache, which allows all callers during that pipeline would get the same, correct value. Changelist: 1. Change the getter so that it does not reset the flag; instead, reset the flag in the beginning of refresh() commit e9f6c98dc58cb7bb842ff1f41063174003277823 Author: Hunter Lee Date: 2018-09-25T17:22:36Z [HELIX-755] TASK: Build quota profile from scratch every rebalance It has been reported that instances have a full quota despite no tasks existing in their CURRENTSTATES. The cause of this is not clear, so making ClusterDataCache trigger a refresh of all AssignableInstances will ensure that there aren't situations where it looks like there has been a thread leak. Optimizations will be implemented if necessary. Changelist: 1. Make AssignableInstanceManager build all AssignableInstances from scratch every rebalance commit 4aaa00727fe34b9cdfde2978eeb8a892dcf29add Author: Hunter Lee Date: 2018-09-25T17:25:39Z [HELIX-756] TASK: Change LOG mode from info to debug In production, it was observed that some users were running thousands of tasks, and since AssignableInstance leaves a line of log for each task assigned or released, the amount of log that was being generated was too much, and it was too verbose. Changelist: 1. Change the logging mode from info to debug in AssignableInstance and AssignableInstanceManager > TASK: Change LOG mode from info to debug > > > Key: HELIX-756 > URL: https://issues.apache.org/jira/browse/HELIX-756 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > In production, it was observed that some users were running thousands of > tasks, and since AssignableInstance leaves a line of log for each task > assigned or released, the amount of log that was being generated was too > much, and it was too verbose. > Changelist: > 1. Change the logging mode from info to debug in AssignableInstance and > AssignableInstanceManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-753) Record top state handoff finished in single cluster data cache refresh
[ https://issues.apache.org/jira/browse/HELIX-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624211#comment-16624211 ] ASF GitHub Bot commented on HELIX-753: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/270 [HELIX-753] Record top state handoff finished in single cluster data cache refresh This PR adds top state handoff reporting when a single pipeline refresh catches the entire handoff process, which we missed before. Here is the rough procedure: - retrieve cached last top state instance for a partition - retrieve current top state instance for a partition - if there is no missing top state record of that partition, and top state instance changed, we record the number Current top state end time is easy to find from current state in cluster data cache, for handoff start time, if we cannot find it, we use last pipeline run's end time for best guess. Detailed reason is explained in code comment. Added test case to verify such top state handoff, and consolidated common part in TestTopStateHandoffMetrics for avoiding code replication You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/topstate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/270.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #270 commit d501e8fa30596d9cd98078f0d1ce7c1ecf20c595 Author: Harry Zhang Date: 2018-09-21T21:32:15Z [HELIX-753] Record top state handoff finished in single cluster data cache refresh > Record top state handoff finished in single cluster data cache refresh > -- > > Key: HELIX-753 > URL: https://issues.apache.org/jira/browse/HELIX-753 > Project: Apache Helix > Issue Type: Bug >Reporter: Harry Zhang >Assignee: Harry Zhang >Priority: Major > > Currently we are calculating top state handoff duration by doing the > following: > - record missing top state when we see a top state missing > - record top state come back when we see it come back > - report top state handoff duration > This is perfectly fine for non-P2P state transitions as the entire top state > handoff process will always finish for >= 2 pipeline runs. However, for P2P > enabled clusters, top state handoff are quick, and if it is quicker than > cluster data refresh stage latency, we will lose a lot of short top state > handoffs, which make the number miserable on ingraph. > We need to revise top state handoff metrics implementation so we don't lose > data point statistically (i.e. we are losing all short handoffs now). > AC: > - revise impl so we catch those short top state hand-offs > - write new tests to catch the fix if needed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-751) TASK: Fix AssignableInstanceComparator so that it sorts unsupported quota types
[ https://issues.apache.org/jira/browse/HELIX-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624157#comment-16624157 ] ASF GitHub Bot commented on HELIX-751: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/268 > TASK: Fix AssignableInstanceComparator so that it sorts unsupported quota > types > --- > > Key: HELIX-751 > URL: https://issues.apache.org/jira/browse/HELIX-751 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > Currently, if the quota type does not exist, it will not sort > AssignableInstances based on availability. This does not cause immediate > problems, but it would be nice to have them sorted because we now allow > unsupported quota types run as DEFAULT type. > Changelist: > 1. Comparator sorts AssignableInstances in a PriorityQueue by DEFAULT type's > availability when the quota type given is unsupported -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-752) Add missing shutdown for RoutingTableProvider
[ https://issues.apache.org/jira/browse/HELIX-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620953#comment-16620953 ] ASF GitHub Bot commented on HELIX-752: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/269 [HELIX-752] Add missing shutdown for RoutingTableProvider Changelist: 1. Add a missing shutdown() call to avoid having a background thread keep printing out error messages You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix e8e5770 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #269 commit 355e4d4e1e678d13793f39d044a51e7be76abed8 Author: Hunter Lee Date: 2018-09-19T17:50:48Z [HELIX-752] Add missing shutdown for RoutingTableProvider Changelist: 1. Add a missing shutdown() call to avoid having a background thread keep printing out error messages > Add missing shutdown for RoutingTableProvider > - > > Key: HELIX-752 > URL: https://issues.apache.org/jira/browse/HELIX-752 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Assignee: Hunter L >Priority: Major > > Changelist: > 1. Add a missing shutdown() call to avoid having a background thread keep > printing out error messages -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-741) Revise unreliable behavior in swapInstance
[ https://issues.apache.org/jira/browse/HELIX-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553439#comment-16553439 ] ASF GitHub Bot commented on HELIX-741: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/258 > Revise unreliable behavior in swapInstance > -- > > Key: HELIX-741 > URL: https://issues.apache.org/jira/browse/HELIX-741 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > swapInstance call did not work properly when we were trying to fix a > production issue. > > The API was old and not actively maintained. It used deprecated underlaying > data accessor API and hit a problem of partial ZK read. Thus our CLI was > unable to update all IdealStates as expected. > We have seen such problem before, especially when the cluster is bug and > there are a lot of data to read back. > This ticket is created to refactor the implementation of swapInstance() to > make it more robust, and separate ticket will be created to revise those old > API calls that are not frequently used not actively maintained. > – > AC: > - make this api call reliable and idempotent -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-741) Revise unreliable behavior in swapInstance
[ https://issues.apache.org/jira/browse/HELIX-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547270#comment-16547270 ] ASF GitHub Bot commented on HELIX-741: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/258 [HELIX-741] make swap instance more robust and idempotent Made swap instance more robust: 1. List ideal state names and read ideal state individually to avoid partial read 2. remove redundant logics that test old instance status 3. make it idempotent 4. added test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/helix-admin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #258 commit 24c52394dfff91c045367260c969f76560ebeb62 Author: Harry Zhang Date: 2018-07-18T01:21:48Z [HELIX-741] make swap instance more robust and idempotent > Revise unreliable behavior in swapInstance > -- > > Key: HELIX-741 > URL: https://issues.apache.org/jira/browse/HELIX-741 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > swapInstance call did not work properly when we were trying to fix a > production issue. > > The API was old and not actively maintained. It used deprecated underlaying > data accessor API and hit a problem of partial ZK read. Thus our CLI was > unable to update all IdealStates as expected. > We have seen such problem before, especially when the cluster is bug and > there are a lot of data to read back. > This ticket is created to refactor the implementation of swapInstance() to > make it more robust, and separate ticket will be created to revise those old > API calls that are not frequently used not actively maintained. > – > AC: > - make this api call reliable and idempotent -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-740) ZkHelixAdmin:NPE
[ https://issues.apache.org/jira/browse/HELIX-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547268#comment-16547268 ] ASF GitHub Bot commented on HELIX-740: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/257 > ZkHelixAdmin:NPE > > > Key: HELIX-740 > URL: https://issues.apache.org/jira/browse/HELIX-740 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > The NPE occurs in this line: > [https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L669] > Basically, we ended up in a situation where we had an instance whose config > was deleted. The line above should handle this more gracefully;we need more > meaningful error information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-740) ZkHelixAdmin:NPE
[ https://issues.apache.org/jira/browse/HELIX-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547184#comment-16547184 ] ASF GitHub Bot commented on HELIX-740: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/257 [HELIX-740] check NPE in getInstancesInClusterWithTag and throw more meaningful exception Added cluster config check in `getInstancesInClusterWithTag()` and throw IllegalStateException when instance config is missing You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/helix-admin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #257 commit f4bb7d60782150c7d713c907211cc9d41f002c48 Author: Harry Zhang Date: 2018-07-17T22:50:02Z [HELIX-740] check NPE in getInstancesInClusterWithTag and throw more meaningful exception > ZkHelixAdmin:NPE > > > Key: HELIX-740 > URL: https://issues.apache.org/jira/browse/HELIX-740 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > The NPE occurs in this line: > [https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L669] > Basically, we ended up in a situation where we had an instance whose config > was deleted. The line above should handle this more gracefully;we need more > meaningful error information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-732) [TASK] Expose UserContentStore in TaskDriver
[ https://issues.apache.org/jira/browse/HELIX-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545781#comment-16545781 ] ASF GitHub Bot commented on HELIX-732: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/246 > [TASK] Expose UserContentStore in TaskDriver > > > Key: HELIX-732 > URL: https://issues.apache.org/jira/browse/HELIX-732 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > There was a user request for this feature. The intended use is to allow for > aggregation work reading from temporary data written by tasks, by allowing a > get() of UserContentStore at the TaskDriver level. UserContentStore is a > potentially useful feature that is currently under-utilized - this will > enable Gobblin and other users of Task Framework to better utilize > UserContentStore. > Changelist: > 1. Add getUserContentStore() in TaskDriver > 2. Add TestUserContentStore, an integration test for this feature > 3. Add descriptive JavaDoc warning the user that get() and put() methods for > UserContentStore is not thread-safe -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-732) [TASK] Expose UserContentStore in TaskDriver
[ https://issues.apache.org/jira/browse/HELIX-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545641#comment-16545641 ] ASF GitHub Bot commented on HELIX-732: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/246 [HELIX-732] Expose UserContentStore in TaskDriver There was a user request for this feature. The intended use is to allow for aggregation work reading from temporary data written by tasks, by allowing a get() of UserContentStore at the TaskDriver level. UserContentStore is a potentially useful feature that is currently under-utilized - this will enable Gobblin and other users of Task Framework to better utilize UserContentStore. Changelist: 1. Add getUserContentStore() in TaskDriver 2. Add TestUserContentStore, an integration test for this feature 3. Add descriptive JavaDoc warning the user that get() and put() methods for UserContentStore is not thread-safe You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 commit cdf91ecdc63b80254b6775f7f12f98440081473a Author: Hunter Lee Date: 2018-07-16T19:18:03Z [HELIX-732] Expose UserContentStore in TaskDriver There was a user request for this feature. The intended use is to allow for aggregation work reading from temporary data written by tasks, by allowing a get() of UserContentStore at the TaskDriver level. UserContentStore is a potentially useful feature that is currently under-utilized - this will enable Gobblin and other users of Task Framework to better utilize UserContentStore. Changelist: 1. Add getUserContentStore() in TaskDriver 2. Add TestUserContentStore, an integration test for this feature 3. Add descriptive JavaDoc warning the user that get() and put() methods for UserContentStore is not thread-safe > [TASK] Expose UserContentStore in TaskDriver > > > Key: HELIX-732 > URL: https://issues.apache.org/jira/browse/HELIX-732 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > There was a user request for this feature. The intended use is to allow for > aggregation work reading from temporary data written by tasks, by allowing a > get() of UserContentStore at the TaskDriver level. UserContentStore is a > potentially useful feature that is currently under-utilized - this will > enable Gobblin and other users of Task Framework to better utilize > UserContentStore. > Changelist: > 1. Add getUserContentStore() in TaskDriver > 2. Add TestUserContentStore, an integration test for this feature > 3. Add descriptive JavaDoc warning the user that get() and put() methods for > UserContentStore is not thread-safe -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-719) [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle
[ https://issues.apache.org/jira/browse/HELIX-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539236#comment-16539236 ] ASF GitHub Bot commented on HELIX-719: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/226 > [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle > -- > > Key: HELIX-719 > URL: https://issues.apache.org/jira/browse/HELIX-719 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > TestPartitionMovementThrottle was failing after the improvement was made in > IntermediateCalcStage so that downward load balance will take place while > recovery balance is happening. In the process of fixing the test, 1. It was > verified by hand that downward load balance is being correctly throttled as > defined by the user in StateTransitionThrottleConfig. 2. An appropriate > parameter adjustment was made to account for both recovery and load balance > happening in the same pipeline iteration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-726) Add new monitor metrics for state transitions.
[ https://issues.apache.org/jira/browse/HELIX-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539070#comment-16539070 ] ASF GitHub Bot commented on HELIX-726: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/233 > Add new monitor metrics for state transitions. > -- > > Key: HELIX-726 > URL: https://issues.apache.org/jira/browse/HELIX-726 > Project: Apache Helix > Issue Type: Bug >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > > ClusterStatus: MissingMinActiveReplicaPartitionGauge > ClusterStatus: TotalResourceGauge > ClusterStatus/ResourceStatus: PendingStateTransitionsGauge > ClusterStatus: StateTransitionsCounter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-726) Add new monitor metrics for state transitions.
[ https://issues.apache.org/jira/browse/HELIX-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539064#comment-16539064 ] ASF GitHub Bot commented on HELIX-726: -- GitHub user jiajunwang opened a pull request: https://github.com/apache/helix/pull/233 [HELIX-726][HELIX-727] Helix monitor metrics improvement Contains 2 changes that improve Helix monitoring. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiajunwang/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #233 commit 505524cf0484dfe83433862e8f59345899d2 Author: Jiajun Wang Date: 2018-05-26T06:30:11Z Add new monitor metrics for state transitions. ClusterStatus: MissingMinActiveReplicaPartitionGauge ClusterStatus: TotalResourceGauge ClusterStatus/ResourceStatus: PendingStateTransitionsGauge ClusterStatus: StateTransitionsCounter commit 93eea714d2a9607c1808a342957e59698a96f543 Author: Jiajun Wang Date: 2018-05-30T00:09:28Z Fix resource monitor race condition. The async monitor processing may cause resource mbean deleting failure. This will leave unnecessary mbean in the mbean server. > Add new monitor metrics for state transitions. > -- > > Key: HELIX-726 > URL: https://issues.apache.org/jira/browse/HELIX-726 > Project: Apache Helix > Issue Type: Bug >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > > ClusterStatus: MissingMinActiveReplicaPartitionGauge > ClusterStatus: TotalResourceGauge > ClusterStatus/ResourceStatus: PendingStateTransitionsGauge > ClusterStatus: StateTransitionsCounter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-719) [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle
[ https://issues.apache.org/jira/browse/HELIX-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537806#comment-16537806 ] ASF GitHub Bot commented on HELIX-719: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/226 [HELIX-719] [HELIX] Verify downward load balance and fix TestPartitio… …nMovementThrottle TestPartitionMovementThrottle was failing after the improvement was made in IntermediateCalcStage so that downward load balance will take place while recovery balance is happening. In the process of fixing the test, 1. It was verified by hand that downward load balance is being correctly throttled as defined by the user in StateTransitionThrottleConfig. 2. An appropriate parameter adjustment was made to account for both recovery and load balance happening in the same pipeline iteration. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix d Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/226.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #226 commit 13c39dac7aca7920f75d62a6c19f03c319d8e6cf Author: Hunter Lee Date: 2018-07-09T23:56:08Z [HELIX-719] [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle TestPartitionMovementThrottle was failing after the improvement was made in IntermediateCalcStage so that downward load balance will take place while recovery balance is happening. In the process of fixing the test, 1. It was verified by hand that downward load balance is being correctly throttled as defined by the user in StateTransitionThrottleConfig. 2. An appropriate parameter adjustment was made to account for both recovery and load balance happening in the same pipeline iteration. > [HELIX] Verify downward load balance and fix TestPartitionMovementThrottle > -- > > Key: HELIX-719 > URL: https://issues.apache.org/jira/browse/HELIX-719 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > TestPartitionMovementThrottle was failing after the improvement was made in > IntermediateCalcStage so that downward load balance will take place while > recovery balance is happening. In the process of fixing the test, 1. It was > verified by hand that downward load balance is being correctly throttled as > defined by the user in StateTransitionThrottleConfig. 2. An appropriate > parameter adjustment was made to account for both recovery and load balance > happening in the same pipeline iteration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537787#comment-16537787 ] ASF GitHub Bot commented on HELIX-718: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/224 > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537784#comment-16537784 ] ASF GitHub Bot commented on HELIX-718: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/223 > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537761#comment-16537761 ] ASF GitHub Bot commented on HELIX-718: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/224 [HELIX-718] implement ThreadCountBasedTaskAssigner In this RB, I implemented a thread count based task assigner that is optimized for short-term use cases. It assumes: - All tasks to assign have same quota type - All tasks to assign requires only 1 thread The algorithms did best effort that tasks with same type / same job are spread out: i.e. - if there are 3 nodes, each has 10 threads for each quota type A, B, and C - node1 is empty, node2 and node3 each has 5 typeB tasks and 5 typeC tasks running => when 3 typeA tasks are to be assigned, it will assign 1 typeA task to each node rather than squeeze all 3 typeA tasks to node1. Added tests for the assigner. Below is the profiling results, each result takes average of 100 trails: Assign 50K tasks onto 1K nodes: testing batch size: 1 Average time: 118ms testing batch size: 5000 Average time: 114ms testing batch size: 2000 Average time: 117ms testing batch size: 1000 Average time: 119ms testing batch size: 500 Average time: 123ms testing batch size: 100 Average time: 182ms Assign 10K tasks onto 1K nodes: testing batch size: 1 Average time: 25ms testing batch size: 5000 Average time: 21ms testing batch size: 2000 Average time: 22ms testing batch size: 1000 Average time: 25ms testing batch size: 500 Average time: 22ms testing batch size: 100 Average time: 34ms You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/simple-assigner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/224.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #224 commit 6cb574d5aea6ca9cb9e6b5184bc80cb5e05d53b8 Author: Harry Zhang Date: 2018-07-09T23:04:19Z [HELIX-718] implement ThreadCountBasedTaskAssigner > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537755#comment-16537755 ] ASF GitHub Bot commented on HELIX-718: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/223 [HELIX-718] provide a method in AssignableInstance to set current assignment This is required when an assignable instance is initialized, it needs to recover its current states You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/assignable-instance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/223.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #223 commit e44b29e03ef4c807e940cde717ed2f6fff58a273 Author: Harry Zhang Date: 2018-07-09T22:59:27Z [HELIX-718] provide a method in AssignableInstance to set current assignments > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537747#comment-16537747 ] ASF GitHub Bot commented on HELIX-718: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/222 > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537739#comment-16537739 ] ASF GitHub Bot commented on HELIX-718: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/222 [HELIX-718] implement AssignableInstance Implement AssignableInstance and related tests as a part of task assigner You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/assignable-instance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/222.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #222 commit 2049f93abe8e56a754e4880a9157959ef24cd89e Author: Harry Zhang Date: 2018-07-09T22:49:33Z [HELIX-718] implement AssignableInstance > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537706#comment-16537706 ] ASF GitHub Bot commented on HELIX-718: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/220 > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-712) Backward compatibility of the rebalance algorithm
[ https://issues.apache.org/jira/browse/HELIX-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537697#comment-16537697 ] ASF GitHub Bot commented on HELIX-712: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/212 > Backward compatibility of the rebalance algorithm > - > > Key: HELIX-712 > URL: https://issues.apache.org/jira/browse/HELIX-712 > Project: Apache Helix > Issue Type: Bug >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > > For keeping CRUSHed stable, we need to split the logic changes made for > constraint based rebalance strategy. Otherwise, some improvement will change > the original assignment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-717) Add api for get / set quota type, ratio and participant capacity
[ https://issues.apache.org/jira/browse/HELIX-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537690#comment-16537690 ] ASF GitHub Bot commented on HELIX-717: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/219 > Add api for get / set quota type, ratio and participant capacity > > > Key: HELIX-717 > URL: https://issues.apache.org/jira/browse/HELIX-717 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > This is needed for supporting quota based task assignment -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-718) Implement TaskAssignment logics
[ https://issues.apache.org/jira/browse/HELIX-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537622#comment-16537622 ] ASF GitHub Bot commented on HELIX-718: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/220 [HELIX-718] implement TaskAssignResult Implement TaskAssignResult as a part of task assigner You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/task-assign-result Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/220.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #220 commit 701947d5a033792f21dd2796a29577702782fd26 Author: Harry Zhang Date: 2018-07-09T21:22:20Z [HELIX-718] implement TaskAssignResult > Implement TaskAssignment logics > --- > > Key: HELIX-718 > URL: https://issues.apache.org/jira/browse/HELIX-718 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Implement assignment logics: > TaskAssigner, TaskAssignResult, AssignableInstance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-717) Add api for get / set quota type, ratio and participant capacity
[ https://issues.apache.org/jira/browse/HELIX-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537601#comment-16537601 ] ASF GitHub Bot commented on HELIX-717: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/219 [HELIX-717] Add api for get / set quota type, ratio and participant capacity Add api for get / set quota type, ratio and participant capacity You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/task-quota Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/219.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #219 commit 9ff603e9c39b53d5035cccb31fcf6edf82d97f18 Author: Harry Zhang Date: 2018-07-09T21:07:56Z [HELIX-717] Add api for get / set quota type, ratio and participant capacity > Add api for get / set quota type, ratio and participant capacity > > > Key: HELIX-717 > URL: https://issues.apache.org/jira/browse/HELIX-717 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > This is needed for supporting quota based task assignment -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-713) Remove unused imports in TaskAssignmentCalculator
[ https://issues.apache.org/jira/browse/HELIX-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537558#comment-16537558 ] ASF GitHub Bot commented on HELIX-713: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/213 > Remove unused imports in TaskAssignmentCalculator > - > > Key: HELIX-713 > URL: https://issues.apache.org/jira/browse/HELIX-713 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > Remove unused imports in TaskAssignmentCalculator -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-709) Prepare controller stages for async execution
[ https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537546#comment-16537546 ] ASF GitHub Bot commented on HELIX-709: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/214 > Prepare controller stages for async execution > - > > Key: HELIX-709 > URL: https://issues.apache.org/jira/browse/HELIX-709 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > There are a couple of stages in helix controller that can be executed > asynchronously, but each execution should be done in order. Currently for > helix controller, we have a thread pool for un-ordered execution, but we also > need one for ordered execution. > In this ticket should do the following: > 1. Create a pool of configurable workers using DedupEventProcessor > 2. Create AbstractAsyncBaseStage for those stages that can be executed > asynchronously to share common code > AC: > Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, > pass all tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-709) Prepare controller stages for async execution
[ https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537428#comment-16537428 ] ASF GitHub Bot commented on HELIX-709: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/214 [HELIX-709] Move external view calculation to async stage and re-organize pipeline - Separated controller pipeline to execute external view compute async and as early as possible - renamed AbstractAsyncBaseStage - fixed NPE in callback handler - all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/async-ev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #214 commit 542fbc840a167986a40bd57f3c5660d294acb63c Author: Harry Zhang Date: 2018-07-09T19:16:56Z [HELIX-709] Move external view calculation to async stage and re-organize pipeline > Prepare controller stages for async execution > - > > Key: HELIX-709 > URL: https://issues.apache.org/jira/browse/HELIX-709 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > There are a couple of stages in helix controller that can be executed > asynchronously, but each execution should be done in order. Currently for > helix controller, we have a thread pool for un-ordered execution, but we also > need one for ordered execution. > In this ticket should do the following: > 1. Create a pool of configurable workers using DedupEventProcessor > 2. Create AbstractAsyncBaseStage for those stages that can be executed > asynchronously to share common code > AC: > Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, > pass all tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-713) Remove unused imports in TaskAssignmentCalculator
[ https://issues.apache.org/jira/browse/HELIX-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537425#comment-16537425 ] ASF GitHub Bot commented on HELIX-713: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/213 [HELIX-713] Remove unused imports in TaskAssignmentCalculator [HELIX-713] Remove unused imports in TaskAssignmentCalculator You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix 1301279 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #213 commit 7ab70a3d57454a8836cd13d8bb172e2b460474f6 Author: Hunter Lee Date: 2018-07-09T19:09:30Z [HELIX-713] Remove unused imports in TaskAssignmentCalculator > Remove unused imports in TaskAssignmentCalculator > - > > Key: HELIX-713 > URL: https://issues.apache.org/jira/browse/HELIX-713 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > Remove unused imports in TaskAssignmentCalculator -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-709) Prepare controller stages for async execution
[ https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526969#comment-16526969 ] ASF GitHub Bot commented on HELIX-709: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/208 > Prepare controller stages for async execution > - > > Key: HELIX-709 > URL: https://issues.apache.org/jira/browse/HELIX-709 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > There are a couple of stages in helix controller that can be executed > asynchronously, but each execution should be done in order. Currently for > helix controller, we have a thread pool for un-ordered execution, but we also > need one for ordered execution. > In this ticket should do the following: > 1. Create a pool of configurable workers using DedupEventProcessor > 2. Create AbstractAsyncBaseStage for those stages that can be executed > asynchronously to share common code > AC: > Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, > pass all tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-710) Create abstract state model for distributed leader standby helix service
[ https://issues.apache.org/jira/browse/HELIX-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526856#comment-16526856 ] ASF GitHub Bot commented on HELIX-710: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/209 > Create abstract state model for distributed leader standby helix service > > > Key: HELIX-710 > URL: https://issues.apache.org/jira/browse/HELIX-710 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > In order to implement state model def for other helix services, I'd prefer to > abstract an interface that helix service would use, to avoid duplicated code. > AC: > - implement AbstractHelixLeaderStandbyStateModel and implement cluster > controller state model with it. The abstract model can also be used by other > helix services -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-710) Create abstract state model for distributed leader standby helix service
[ https://issues.apache.org/jira/browse/HELIX-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526817#comment-16526817 ] ASF GitHub Bot commented on HELIX-710: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/209 [HELIX-710] Create abstract state model for distributed leader standby helix service This RB abstracts a leader standby state model that helix services such as controller or other services would commonly use. This reduces duplicated code and simplifies state model implementation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/abstract-ls-state-model Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/209.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #209 commit 4a99bc43c6f22e478a49fb7f2bbac42d608f17b5 Author: Harry Zhang Date: 2018-06-28T21:32:51Z [HELIX-710] Create abstract state model for distributed leader standby helix service > Create abstract state model for distributed leader standby helix service > > > Key: HELIX-710 > URL: https://issues.apache.org/jira/browse/HELIX-710 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > In order to implement state model def for other helix services, I'd prefer to > abstract an interface that helix service would use, to avoid duplicated code. > AC: > - implement AbstractHelixLeaderStandbyStateModel and implement cluster > controller state model with it. The abstract model can also be used by other > helix services -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-709) Prepare controller stages for async execution
[ https://issues.apache.org/jira/browse/HELIX-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526801#comment-16526801 ] ASF GitHub Bot commented on HELIX-709: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/208 [HELIX-709] Prepare controller stages for async execution - Implemented AbstractAsyncBaseStage - Refactored TEVCalcState and PersistAssignmentStage to use AbstractAsyncBaseStage You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/aabs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/208.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #208 commit 9080c64429d724aa959207411ca06d690f5ee840 Author: Harry Zhang Date: 2018-06-28T21:25:21Z [HELIX-709] Prepare controller stages for async execution > Prepare controller stages for async execution > - > > Key: HELIX-709 > URL: https://issues.apache.org/jira/browse/HELIX-709 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > There are a couple of stages in helix controller that can be executed > asynchronously, but each execution should be done in order. Currently for > helix controller, we have a thread pool for un-ordered execution, but we also > need one for ordered execution. > In this ticket should do the following: > 1. Create a pool of configurable workers using DedupEventProcessor > 2. Create AbstractAsyncBaseStage for those stages that can be executed > asynchronously to share common code > AC: > Create AbstractAsyncBaseStage and DedupFIFOWorkerPool for async execution, > pass all tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-706) ExternalViewGeneration should be executed asynchronously in Helix controller
[ https://issues.apache.org/jira/browse/HELIX-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526787#comment-16526787 ] ASF GitHub Bot commented on HELIX-706: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/206 > ExternalViewGeneration should be executed asynchronously in Helix controller > > > Key: HELIX-706 > URL: https://issues.apache.org/jira/browse/HELIX-706 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > EV generation should not block helix resource rebalance. According to our > profiling results, external view generation takes ~ 1/5 of the pipeline > latency. > The goal is to generate external view asynchronously, and hopefully we can > have 20% improvement in rebalance pipeline -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-707) Fix topstate handoff metrics.
[ https://issues.apache.org/jira/browse/HELIX-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524390#comment-16524390 ] ASF GitHub Bot commented on HELIX-707: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/207 > Fix topstate handoff metrics. > - > > Key: HELIX-707 > URL: https://issues.apache.org/jira/browse/HELIX-707 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.8.x >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > > We've confirmed a bug in the logic that calculates topstate handoff duration. > With this issue, if the previous master instance is offline, an older handoff > start time could be used to calculate the duration. > This results in huge handoff duration in the Helix metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-707) Fix topstate handoff metrics.
[ https://issues.apache.org/jira/browse/HELIX-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524340#comment-16524340 ] ASF GitHub Bot commented on HELIX-707: -- GitHub user jiajunwang opened a pull request: https://github.com/apache/helix/pull/207 [HELIX-707] Fix topstate handoff metrics. We've confirmed a bug in the logic that calculates topstate handoff duration. With this issue, if the previous master instance is offline, an older handoff start time could be used to calculate the duration. This results in huge handoff duration in the Helix metrics. This change will fix this bug. If the previous node that holds topstate replica goes to offline, the offline time will be used as the start time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiajunwang/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #207 commit 7753b602cee6d08a8326a68f899cb089378aae9f Author: Jiajun Wang Date: 2018-04-27T17:56:43Z Fix topstate handoff metrics. We've confirmed a bug in the logic that calculates topstate handoff duration. With this issue, if the previous master instance is offline, an older handoff start time could be used to calculate the duration. This results in huge handoff duration in the Helix metrics. This change will fix this bug. If the previous node that holds topstate replica goes to offline, the offline time will be used as the start time. RB=1295351 G=helix-reviewers A=lxia,hrzhang > Fix topstate handoff metrics. > - > > Key: HELIX-707 > URL: https://issues.apache.org/jira/browse/HELIX-707 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.8.x >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > > We've confirmed a bug in the logic that calculates topstate handoff duration. > With this issue, if the previous master instance is offline, an older handoff > start time could be used to calculate the duration. > This results in huge handoff duration in the Helix metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-706) ExternalViewGeneration should be executed asynchronously in Helix controller
[ https://issues.apache.org/jira/browse/HELIX-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524313#comment-16524313 ] ASF GitHub Bot commented on HELIX-706: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/206 [HELIX-706] process tev and persist assignment asynchronously Added async worker in generic helix controller to process persist assignment stage and tev generation state asynchronously You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/async-ev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #206 commit bb4ffd7e5663377427a5ad5988948659dd0db378 Author: Harry Zhang Date: 2018-06-26T23:05:50Z [HELIX-706] process tev and persist assignment asynchronously > ExternalViewGeneration should be executed asynchronously in Helix controller > > > Key: HELIX-706 > URL: https://issues.apache.org/jira/browse/HELIX-706 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > EV generation should not block helix resource rebalance. According to our > profiling results, external view generation takes ~ 1/5 of the pipeline > latency. > The goal is to generate external view asynchronously, and hopefully we can > have 20% improvement in rebalance pipeline -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-705) Participant duplicated state transition handling rework
[ https://issues.apache.org/jira/browse/HELIX-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524288#comment-16524288 ] ASF GitHub Bot commented on HELIX-705: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/204 > Participant duplicated state transition handling rework > --- > > Key: HELIX-705 > URL: https://issues.apache.org/jira/browse/HELIX-705 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Helix should have some re-work on participant side message handling: > - Duplicated message in same batch: discard the later one > - Duplicated message in different batches, the later one should be discarded > if the first one is in progress > - During state transition, we should not rely on current state delta to get > partition's current state, but should lock on state model def (thread safety) > - Duplicated state transition (toState == currentState) should not result in > error, which is confusion, but should report success -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-703) Change print statement to log statement
[ https://issues.apache.org/jira/browse/HELIX-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524264#comment-16524264 ] ASF GitHub Bot commented on HELIX-703: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/203 > Change print statement to log statement > --- > > Key: HELIX-703 > URL: https://issues.apache.org/jira/browse/HELIX-703 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Reporter: Hunter L >Priority: Major > > Change print statement to log statement -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-705) Participant duplicated state transition handling rework
[ https://issues.apache.org/jira/browse/HELIX-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522917#comment-16522917 ] ASF GitHub Bot commented on HELIX-705: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/204 [HELIX-705]: Participant duplicated state transition handling rework Re-implemented helix task executor state transition message dedup logic, and added tests for verifying it: - Duplicated message in same batch: discard the later one - Duplicated message in different batches, the later one should be discarded if the first one is in progress - During state transition, we should not rely on current state delta to get partition's current state, but should lock on state model def (thread safety) - Duplicated state transition (toState == currentState) should not result in error, which is confusion, but should report success You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/participant-st-dedup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/204.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #204 commit 04f1ba9701ccfb4c55d44ab4bc159577c3afd68b Author: Harry Zhang Date: 2018-06-25T22:55:14Z [HELIX-705]: Participant duplicated state transition handling rework > Participant duplicated state transition handling rework > --- > > Key: HELIX-705 > URL: https://issues.apache.org/jira/browse/HELIX-705 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Helix should have some re-work on participant side message handling: > - Duplicated message in same batch: discard the later one > - Duplicated message in different batches, the later one should be discarded > if the first one is in progress > - During state transition, we should not rely on current state delta to get > partition's current state, but should lock on state model def (thread safety) > - Duplicated state transition (toState == currentState) should not result in > error, which is confusion, but should report success -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-703) Change print statement to log statement
[ https://issues.apache.org/jira/browse/HELIX-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522846#comment-16522846 ] ASF GitHub Bot commented on HELIX-703: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/203 [HELIX-703] Change print statement to log statement You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/203.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #203 commit 0d77cbafc6534dd7b0e9867b1dbf8a2266fd2281 Author: Hunter Lee Date: 2018-06-25T21:31:00Z [HELIX-703] Change print statement to log statement > Change print statement to log statement > --- > > Key: HELIX-703 > URL: https://issues.apache.org/jira/browse/HELIX-703 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Reporter: Hunter L >Priority: Major > > Change print statement to log statement -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-701) Potential ugly NPE
[ https://issues.apache.org/jira/browse/HELIX-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471422#comment-16471422 ] ASF GitHub Bot commented on HELIX-701: -- Github user brettKK commented on the issue: https://github.com/apache/helix/pull/200 @lujiefsi , https://issues.apache.org/jira/browse/HELIX-701 has been automatically associated with this PR. > Potential ugly NPE > -- > > Key: HELIX-701 > URL: https://issues.apache.org/jira/browse/HELIX-701 > Project: Apache Helix > Issue Type: Bug >Reporter: brettkk >Priority: Major > > We have developed a static analysis tool > [NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential > NPE. Our analysis shows that some callees may return null in corner case(e.g. > node crash , IOException), some of their callers have _!=null_ check but > some do not have. In this issue we post a patch which can add !=null based > on existed !=null check. For example: > ZkGrep#parseZkSnapshot: > {code:java} > return retFiles; > } catch (Exception e) { > LOG.error("fail to parse zkSnapshot: " + lastZkSnapshot, e); > } > return null;{code} > So parseZkSnapshot will return null while IOException happens. but its caller > ZkGrep#processCommandLineArgs have no null checker: > {code:java} > File[] lastZkSnapshot = parseZkSnapshot(zkDataDirs[1], byTime); > // lastZkSnapshot[1] is the parsed last snapshot by byTime > grepZkSnapshot(lastZkSnapshot[1], patterns); > {code} > We should terminate the process while lastZkSnapshot == null -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-701) Potential ugly NPE
[ https://issues.apache.org/jira/browse/HELIX-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470359#comment-16470359 ] ASF GitHub Bot commented on HELIX-701: -- Github user lujiefsi commented on the issue: https://github.com/apache/helix/pull/200 we should combine https://issues.apache.org/jira/browse/HELIX-701 with this pull request, > Potential ugly NPE > -- > > Key: HELIX-701 > URL: https://issues.apache.org/jira/browse/HELIX-701 > Project: Apache Helix > Issue Type: Bug >Reporter: brettkk >Priority: Major > > We have developed a static analysis tool > [NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential > NPE. Our analysis shows that some callees may return null in corner case(e.g. > node crash , IOException), some of their callers have _!=null_ check but > some do not have. In this issue we post a patch which can add !=null based > on existed !=null check. For example: > ZkGrep#parseZkSnapshot: > {code:java} > return retFiles; > } catch (Exception e) { > LOG.error("fail to parse zkSnapshot: " + lastZkSnapshot, e); > } > return null;{code} > So parseZkSnapshot will return null while IOException happens. but its caller > ZkGrep#processCommandLineArgs have no null checker: > {code:java} > File[] lastZkSnapshot = parseZkSnapshot(zkDataDirs[1], byTime); > // lastZkSnapshot[1] is the parsed last snapshot by byTime > grepZkSnapshot(lastZkSnapshot[1], patterns); > {code} > We should terminate the process while lastZkSnapshot == null -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message
[ https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451476#comment-16451476 ] ASF GitHub Bot commented on HELIX-681: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/197 > Participant should not fail state transition on fail to delete / relay message > -- > > Key: HELIX-681 > URL: https://issues.apache.org/jira/browse/HELIX-681 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently we have a general try-catch block in HelixTask and > HelixTaskExecutor, which, upon any exception thrown from state transition > routine, will fail state transition. However there are at least the following > cases in which state transition should be considered as successful: > * When we fail to delete message after successfully handled message and > updated current state -> this is because we already completed state > transition and current state is consistent between participant and ZK > * When we fail to send out relay message > as relay message provides only > best effort of delivering messages, which has nothing to do with state > transition's results. In case of fail to relay message, controller will > resend message which ensures correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message
[ https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451423#comment-16451423 ] ASF GitHub Bot commented on HELIX-681: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/197 [HELIX-681] change controller msg purge timeout to larger number Changed message purge delay to 1min, updated tests accordingly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/ctl-msg-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/197.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #197 commit 4e02cbb9945279b7085e5c725b9d966b90086cc7 Author: Harry Zhang Date: 2018-04-24T23:46:14Z [HELIX-681] change controller msg purge timeout to larger number > Participant should not fail state transition on fail to delete / relay message > -- > > Key: HELIX-681 > URL: https://issues.apache.org/jira/browse/HELIX-681 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently we have a general try-catch block in HelixTask and > HelixTaskExecutor, which, upon any exception thrown from state transition > routine, will fail state transition. However there are at least the following > cases in which state transition should be considered as successful: > * When we fail to delete message after successfully handled message and > updated current state -> this is because we already completed state > transition and current state is consistent between participant and ZK > * When we fail to send out relay message > as relay message provides only > best effort of delivering messages, which has nothing to do with state > transition's results. In case of fail to relay message, controller will > resend message which ensures correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-682) Stale message should not prevent controller from rebalancing resource
[ https://issues.apache.org/jira/browse/HELIX-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451357#comment-16451357 ] ASF GitHub Bot commented on HELIX-682: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/195 > Stale message should not prevent controller from rebalancing resource > - > > Key: HELIX-682 > URL: https://issues.apache.org/jira/browse/HELIX-682 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently during MessageGenerationPhase, we skip re-balancing when there is > pending message. Though we assume that participant will delete messages when > they finish the task, there will be cases that when ZK is not stable and > participant fail to do so, which will leave message un-deleted and thus block > rebalance. > Ideally on controller side, we should try to delete message as well: if > partition's current state is same as message's toState, or there is totally > invalid message remaining, controller should try to delete message to unblock > rebalancing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-682) Stale message should not prevent controller from rebalancing resource
[ https://issues.apache.org/jira/browse/HELIX-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451348#comment-16451348 ] ASF GitHub Bot commented on HELIX-682: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/195 [HELIX-682] delete duplicated message and log error in HelixTaskExecutor on participant This PR is the second part of message dedup on participant side You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/participant-msg-dedup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #195 commit 8aba9bea0734da11722fbc8cceb74f34dd6a37c6 Author: Harry Zhang Date: 2018-04-24T22:34:08Z [HELIX-682] delete duplicated message and log error in HelixTaskExecutor on participant > Stale message should not prevent controller from rebalancing resource > - > > Key: HELIX-682 > URL: https://issues.apache.org/jira/browse/HELIX-682 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently during MessageGenerationPhase, we skip re-balancing when there is > pending message. Though we assume that participant will delete messages when > they finish the task, there will be cases that when ZK is not stable and > participant fail to do so, which will leave message un-deleted and thus block > rebalance. > Ideally on controller side, we should try to delete message as well: if > partition's current state is same as message's toState, or there is totally > invalid message remaining, controller should try to delete message to unblock > rebalancing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-674) Constraint Based Resource Rebalancer
[ https://issues.apache.org/jira/browse/HELIX-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449076#comment-16449076 ] ASF GitHub Bot commented on HELIX-674: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/145 > Constraint Based Resource Rebalancer > > > Key: HELIX-674 > URL: https://issues.apache.org/jira/browse/HELIX-674 > Project: Apache Helix > Issue Type: New Feature >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > Fix For: 0.8.x > > Attachments: Constraint-BasedResourceRebalancing-080318-2226-240.pdf > > > Helix rebalancer assigns resources according to different strategies. > Recently, we optimize the strategy for evenness and minimize movement. > However, the evenness here only applies to partition numbers. Moreover, we've > got more requests for customizable rebalancer from our users. > Take partition weight as an example: > In reality, partition replicas have different size. We use "partition weight" > as an abstraction of the partition size. It can be network traffic usage, > disk usage, or any other combined factors. > Given each partition may have different weights, Helix should be able to > assign partition accordingly. So that the distribution would be even > regarding the weight. > In this project, we are planning new rebalancer mechanism that generates > resource partition assignment according to a list of "constraints". Current > rebalance strategy can be regarded as one kind of constraint. Moving forward, > Helix users would be able to extend the constraint interface using their own > logic. > Some init discussions are in progress and we will have a proposal posted here > soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-674) Constraint Based Resource Rebalancer
[ https://issues.apache.org/jira/browse/HELIX-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448868#comment-16448868 ] ASF GitHub Bot commented on HELIX-674: -- Github user lei-xia commented on the issue: https://github.com/apache/helix/pull/145 Can you rebase to HEAD? > Constraint Based Resource Rebalancer > > > Key: HELIX-674 > URL: https://issues.apache.org/jira/browse/HELIX-674 > Project: Apache Helix > Issue Type: New Feature >Reporter: Jiajun Wang >Assignee: Jiajun Wang >Priority: Major > Fix For: 0.8.x > > Attachments: Constraint-BasedResourceRebalancing-080318-2226-240.pdf > > > Helix rebalancer assigns resources according to different strategies. > Recently, we optimize the strategy for evenness and minimize movement. > However, the evenness here only applies to partition numbers. Moreover, we've > got more requests for customizable rebalancer from our users. > Take partition weight as an example: > In reality, partition replicas have different size. We use "partition weight" > as an abstraction of the partition size. It can be network traffic usage, > disk usage, or any other combined factors. > Given each partition may have different weights, Helix should be able to > assign partition accordingly. So that the distribution would be even > regarding the weight. > In this project, we are planning new rebalancer mechanism that generates > resource partition assignment according to a list of "constraints". Current > rebalance strategy can be regarded as one kind of constraint. Moving forward, > Helix users would be able to extend the constraint interface using their own > logic. > Some init discussions are in progress and we will have a proposal posted here > soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-696) Workflow state messed up after timeout, and is not cleaned
[ https://issues.apache.org/jira/browse/HELIX-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446024#comment-16446024 ] ASF GitHub Bot commented on HELIX-696: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/191 > Workflow state messed up after timeout, and is not cleaned > -- > > Key: HELIX-696 > URL: https://issues.apache.org/jira/browse/HELIX-696 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Couple of problems with current workflow finish handling logic: > # After timeout, timer is not scheduled to clean it up when workflow expires > # After timeout, state handling logic is messy that previously stopped > workflow states flip-flop between TIMED_OUT and STOPPED > # MBean is not updated correctly as we update latency before setting finish > time -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-696) Workflow state messed up after timeout, and is not cleaned
[ https://issues.apache.org/jira/browse/HELIX-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444970#comment-16444970 ] ASF GitHub Bot commented on HELIX-696: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/191 [HELIX-696] fix workflow state flip-flop issue Fixed issues: *After timeout, timer is not scheduled to clean it up when workflow expires *After timeout, state handling logic is messy that previously stopped workflow states flip-flop between TIMED_OUT and STOPPED *MBean is not updated correctly as we update latency before setting finish time Added tests to verify that changes work. Note that currently task framework logic is messy, and this PR tried to focus on fixing issues rather than a major refactor, which is enough provide that we are working on task framework 2.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/workflow-state-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/191.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #191 commit ecacb49b33327ec2dd3d50471f88c99d138ba24c Author: Harry Zhang Date: 2018-04-19T23:13:13Z [HELIX-696] fix workflow state flip-flop issue > Workflow state messed up after timeout, and is not cleaned > -- > > Key: HELIX-696 > URL: https://issues.apache.org/jira/browse/HELIX-696 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Couple of problems with current workflow finish handling logic: > # After timeout, timer is not scheduled to clean it up when workflow expires > # After timeout, state handling logic is messy that previously stopped > workflow states flip-flop between TIMED_OUT and STOPPED > # MBean is not updated correctly as we update latency before setting finish > time -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-699) Compare InstanceConfigs using their IDs in RoutingTable
[ https://issues.apache.org/jira/browse/HELIX-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444811#comment-16444811 ] ASF GitHub Bot commented on HELIX-699: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/188 > Compare InstanceConfigs using their IDs in RoutingTable > --- > > Key: HELIX-699 > URL: https://issues.apache.org/jira/browse/HELIX-699 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > A possible race condition was causing a NPE on InstanceConfig.getHostName(). > Instead of comparing hostnames and ports, we compare IDs, which are supposed > to be concatenation of instance name, hostname, and port anyways and should > always be set. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-698) Add periodic refresh to RoutingTableProvider
[ https://issues.apache.org/jira/browse/HELIX-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444800#comment-16444800 ] ASF GitHub Bot commented on HELIX-698: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/187 > Add periodic refresh to RoutingTableProvider > - > > Key: HELIX-698 > URL: https://issues.apache.org/jira/browse/HELIX-698 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > There have been incidents where RoutingTableProvider was not getting a proper > refresh potentially due to the lag in ZKClient CallbackHandler or > connectivity issues. This addition of periodic refresh avoids cases where > RoutingTableProvider is severely delayed by initiating periodic refreshes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-697) Add cluster level metrics in ClusterStatusMonitor
[ https://issues.apache.org/jira/browse/HELIX-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444788#comment-16444788 ] ASF GitHub Bot commented on HELIX-697: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/186 > Add cluster level metrics in ClusterStatusMonitor > - > > Key: HELIX-697 > URL: https://issues.apache.org/jira/browse/HELIX-697 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > Add cluster level metrics in ClusterStatusMonitor -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-699) Compare InstanceConfigs using their IDs in RoutingTable
[ https://issues.apache.org/jira/browse/HELIX-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444766#comment-16444766 ] ASF GitHub Bot commented on HELIX-699: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/188 [HELIX-699] Compare InstanceConfigs using their IDs in RoutingTable A possible race condition was causing a NPE on InstanceConfig.getHostName(). Instead of comparing hostnames and ports, we compare IDs, which are supposed to be concatenation of instance name, hostname, and port anyways and should always be set. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix instConfigNullCheck Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/188.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #188 commit 2b076c1f97dca95ef4ad817fd45d47c1ec4ff337 Author: Hunter Lee Date: 2018-04-19T20:47:28Z [HELIX-699] Compare InstanceConfigs using their IDs in RoutingTable A possible race condition was causing a NPE on InstanceConfig.getHostName(). Instead of comparing hostnames and ports, we compare IDs, which are supposed to be concatenation of instance name, hostname, and port anyways and should always be set. > Compare InstanceConfigs using their IDs in RoutingTable > --- > > Key: HELIX-699 > URL: https://issues.apache.org/jira/browse/HELIX-699 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > A possible race condition was causing a NPE on InstanceConfig.getHostName(). > Instead of comparing hostnames and ports, we compare IDs, which are supposed > to be concatenation of instance name, hostname, and port anyways and should > always be set. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-698) Add periodic refresh to RoutingTableProvider
[ https://issues.apache.org/jira/browse/HELIX-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444758#comment-16444758 ] ASF GitHub Bot commented on HELIX-698: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/187 [HELIX-698] Add periodic refresh to RoutingTableProvider There have been incidents where RoutingTableProvider was not getting a proper refresh potentially due to the lag in ZKClient CallbackHandler or connectivity issues. This addition of periodic refresh avoids cases where RoutingTableProvider is severely delayed by initiating periodic refreshes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix periodicRefresh Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/187.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #187 commit efc1e81c24c23c4dfc61c85c14708533d30b032c Author: Hunter Lee Date: 2018-04-19T20:42:37Z [HELIX-698] Add periodic refresh to RoutingTableProvider There have been incidents where RoutingTableProvider was not getting a proper refresh potentially due to the lag in ZKClient CallbackHandler or connectivity issues. This addition of periodic refresh avoids cases where RoutingTableProvider is severely delayed by initiating periodic refreshes. > Add periodic refresh to RoutingTableProvider > - > > Key: HELIX-698 > URL: https://issues.apache.org/jira/browse/HELIX-698 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > There have been incidents where RoutingTableProvider was not getting a proper > refresh potentially due to the lag in ZKClient CallbackHandler or > connectivity issues. This addition of periodic refresh avoids cases where > RoutingTableProvider is severely delayed by initiating periodic refreshes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-697) Add cluster level metrics in ClusterStatusMonitor
[ https://issues.apache.org/jira/browse/HELIX-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444752#comment-16444752 ] ASF GitHub Bot commented on HELIX-697: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/186 [HELIX-697] Add cluster level metrics in ClusterStatusMonitor You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix clusterLevelMetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #186 commit e1faf2404c3bb74aab7c402d76246b41af74fd16 Author: Hunter Lee Date: 2018-04-19T20:33:54Z [HELIX-697] Add cluster level metrics in ClusterStatusMonitor > Add cluster level metrics in ClusterStatusMonitor > - > > Key: HELIX-697 > URL: https://issues.apache.org/jira/browse/HELIX-697 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > Add cluster level metrics in ClusterStatusMonitor -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-690) Batch message should not share same NotificationContext object to update CurrentState
[ https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444513#comment-16444513 ] ASF GitHub Bot commented on HELIX-690: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/181 > Batch message should not share same NotificationContext object to update > CurrentState > - > > Key: HELIX-690 > URL: https://issues.apache.org/jira/browse/HELIX-690 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently batch message has bugs: > 1. Batch message is triggering a lot of duplicated state transition messages > sent from controller, result in "state does not match" error on participant > side. This will further create a lot of ERROR znodes in ZK, which adds up > both read/write workload in participant and controller > 2. We see a lot of concurrent update exceptions as well > {noformat} > 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] > [org.apache.helix.messaging.handling.HelixTask:113] - Exception while > executing a message. java.util.ConcurrentModificat > ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: > STATE_TRANSITION > 9909349-java.util.ConcurrentModificationException > 9909350- at > java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) > 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) > 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497) > 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121) > 9909354- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182) > 9909355- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170) > 9909356- at > org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118) > 9909357- at > org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203) > 9909358- at > org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96) > {noformat} > The above 2 errors are resulted in the fact that in HelixTaskExecutor, all > HelixTask objects from same batch of messages are sharing the same > changeContext object. For batch message, HelixTask will create current state > update map to record current state updates, and therefore result in a racing > condition in current state recording - it is very normal that due to such > bug, resource's current state is changed on participant side, current state > is not updated in ZK, and after message is removed, controller still think > that state transition is not finished, and send duplicated state transition > message. > > The error situation will only be triggered when the load is high, so not > covered by our unit / e2e tests > To fix the issue, we should create deep copies of NotificationContext object > for each HelixTask in HelixTaskExecutor. I tried this fix using large data > sets, and it worked. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-695) Add Helix Manager listener for new connection notification
[ https://issues.apache.org/jira/browse/HELIX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1692#comment-1692 ] ASF GitHub Bot commented on HELIX-695: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/182 > Add Helix Manager listener for new connection notification > -- > > Key: HELIX-695 > URL: https://issues.apache.org/jira/browse/HELIX-695 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Currently HelixManager is not notifying state listener about connection > establishment. Adding this notification is useful since HelixManager supports > get ZkClient method and when connection is re-established, ZkClient is newly > created and users who used get method to extract client should be notified > and refresh their client. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-695) Add Helix Manager listener for new connection notification
[ https://issues.apache.org/jira/browse/HELIX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439730#comment-16439730 ] ASF GitHub Bot commented on HELIX-695: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/182 [HELIX-695] add helix manager listener for new connection notification In this PR I added invocation and related tests of `stateListener.onConnected()` method in ZkHelixManager when it is connected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/helix-manager-onconnected Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/182.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #182 commit 65e84713503437c542e545abd521c2ba6d26 Author: Harry Zhang Date: 2018-04-16T17:05:30Z [HELIX-695] add helix manager listener for new connection notification > Add Helix Manager listener for new connection notification > -- > > Key: HELIX-695 > URL: https://issues.apache.org/jira/browse/HELIX-695 > Project: Apache Helix > Issue Type: Task >Reporter: Hao Zhang >Priority: Major > > Currently HelixManager is not notifying state listener about connection > establishment. Adding this notification is useful since HelixManager supports > get ZkClient method and when connection is re-established, ZkClient is newly > created and users who used get method to extract client should be notified > and refresh their client. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-690) Batch message should not share same NotificationContext object to update CurrentState
[ https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439716#comment-16439716 ] ASF GitHub Bot commented on HELIX-690: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/181 [HELIX-690] batch message execution should not share same context In this PR, I added deep copy methods to NotificationContext so when processing messages in batch, different thread would not share the same notification context. This solves the problem that when processing BatchMessages, each thread will have their own current state delta to work on, so current states won't be messed up. Also modified some logs to make it more useful when debugging You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/batch-msg-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #181 commit bb7751b0f52aadcf04b7813fa3e99c8e266a3d0b Author: Harry Zhang Date: 2018-04-16T16:55:43Z [HELIX-690] batch message execution should not share same context > Batch message should not share same NotificationContext object to update > CurrentState > - > > Key: HELIX-690 > URL: https://issues.apache.org/jira/browse/HELIX-690 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently batch message has bugs: > 1. Batch message is triggering a lot of duplicated state transition messages > sent from controller, result in "state does not match" error on participant > side. This will further create a lot of ERROR znodes in ZK, which adds up > both read/write workload in participant and controller > 2. We see a lot of concurrent update exceptions as well > {noformat} > 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] > [org.apache.helix.messaging.handling.HelixTask:113] - Exception while > executing a message. java.util.ConcurrentModificat > ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: > STATE_TRANSITION > 9909349-java.util.ConcurrentModificationException > 9909350- at > java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) > 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) > 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497) > 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121) > 9909354- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182) > 9909355- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170) > 9909356- at > org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118) > 9909357- at > org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203) > 9909358- at > org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96) > {noformat} > The above 2 errors are resulted in the fact that in HelixTaskExecutor, all > HelixTask objects from same batch of messages are sharing the same > changeContext object. For batch message, HelixTask will create current state > update map to record current state updates, and therefore result in a racing > condition in current state recording - it is very normal that due to such > bug, resource's current state is changed on participant side, current state > is not updated in ZK, and after message is removed, controller still think > that state transition is not finished, and send duplicated state transition > message. > > The error situation will only be triggered when the load is high, so not > covered by our unit / e2e tests > To fix the issue, we should create deep copies of NotificationContext object > for each HelixTask in HelixTaskExecutor. I tried this fix using large data > sets, and it worked. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-689) Controller message cleanup is spitting too many logs
[ https://issues.apache.org/jira/browse/HELIX-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431109#comment-16431109 ] ASF GitHub Bot commented on HELIX-689: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/173 > Controller message cleanup is spitting too many logs > > > Key: HELIX-689 > URL: https://issues.apache.org/jira/browse/HELIX-689 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently we print out error log when we fail to remove logs. However, due to > ZK client limitation, we are printing logs even when the message is already > deleted, which should not be regarded as a failure > Need to perform log cleanup and only print out log when there is real error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-692) Use map instead of set in controller's message cleanup logic
[ https://issues.apache.org/jira/browse/HELIX-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431108#comment-16431108 ] ASF GitHub Bot commented on HELIX-692: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/175 > Use map instead of set in controller's message cleanup logic > > > Key: HELIX-692 > URL: https://issues.apache.org/jira/browse/HELIX-692 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > This is to avoid duplicated cleans of same message, as under batch message > mode, we are storing same message under all resources and therefore causing > extra deletion api calls for same message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431104#comment-16431104 ] ASF GitHub Bot commented on HELIX-691: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/176 > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431069#comment-16431069 ] ASF GitHub Bot commented on HELIX-691: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/176 [HELIX-691] Allow users to update InstanceConfig In helix-rest, we provide a method in InstanceAccessor, updateInstanceConfig, that updates the instance's config through a POST call. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix instConfig2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/176.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #176 commit 72d52484716bf2f90323f141a478893fdd0843f1 Author: narendly Date: 2018-04-09T19:04:26Z [HELIX-691] Allow users to update InstanceConfig In helix-rest, we provide a method in InstanceAccessor, updateInstanceConfig, that updates the instance's config through a POST call. > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431061#comment-16431061 ] ASF GitHub Bot commented on HELIX-691: -- Github user narendly closed the pull request at: https://github.com/apache/helix/pull/174 > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430971#comment-16430971 ] ASF GitHub Bot commented on HELIX-691: -- Github user zhan849 commented on a diff in the pull request: https://github.com/apache/helix/pull/174#discussion_r180175528 --- Diff: helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java --- @@ -223,60 +224,60 @@ public Response updateInstance(@PathParam("clusterId") String clusterId, } switch (cmd) { - case enable: -admin.enableInstance(clusterId, instanceName, true); -break; - case disable: -admin.enableInstance(clusterId, instanceName, false); -break; - case reset: -if (!validInstance(node, instanceName)) { - return badRequest("Instance names are not match!"); -} -admin.resetPartition(clusterId, instanceName, -node.get(InstanceProperties.resource.name()).toString(), (List) OBJECT_MAPPER - .readValue(node.get(InstanceProperties.partitions.name()).toString(), -OBJECT_MAPPER.getTypeFactory() -.constructCollectionType(List.class, String.class))); -break; - case addInstanceTag: -if (!validInstance(node, instanceName)) { - return badRequest("Instance names are not match!"); -} -for (String tag : (List) OBJECT_MAPPER - .readValue(node.get(InstanceProperties.instanceTags.name()).toString(), - OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, String.class))) { - admin.addInstanceTag(clusterId, instanceName, tag); -} -break; - case removeInstanceTag: -if (!validInstance(node, instanceName)) { - return badRequest("Instance names are not match!"); -} -for (String tag : (List) OBJECT_MAPPER - .readValue(node.get(InstanceProperties.instanceTags.name()).toString(), - OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, String.class))) { - admin.removeInstanceTag(clusterId, instanceName, tag); -} -break; - case enablePartitions: -admin.enablePartition(true, clusterId, instanceName, -node.get(InstanceProperties.resource.name()).getTextValue(), -(List) OBJECT_MAPPER - .readValue(node.get(InstanceProperties.partitions.name()).toString(), -OBJECT_MAPPER.getTypeFactory() -.constructCollectionType(List.class, String.class))); -break; - case disablePartitions: -admin.enablePartition(false, clusterId, instanceName, -node.get(InstanceProperties.resource.name()).getTextValue(), -(List) OBJECT_MAPPER - .readValue(node.get(InstanceProperties.partitions.name()).toString(), - OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, String.class))); -break; - default: -_logger.error("Unsupported command :" + command); -return badRequest("Unsupported command :" + command); +case enable: --- End diff -- Helix's formatter does not indent case, could you pls revert it back? Same for other places > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430970#comment-16430970 ] ASF GitHub Bot commented on HELIX-691: -- Github user zhan849 commented on a diff in the pull request: https://github.com/apache/helix/pull/174#discussion_r180178181 --- Diff: helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java --- @@ -315,26 +316,27 @@ public Response getInstanceConfig(@PathParam("clusterId") String clusterId, return notFound(); } - @PUT + @POST --- End diff -- PUT is for "set" and POST is for "patch", I'd suggest we keep both. @dasahcc thoughts? > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430969#comment-16430969 ] ASF GitHub Bot commented on HELIX-691: -- Github user zhan849 commented on a diff in the pull request: https://github.com/apache/helix/pull/174#discussion_r180174778 --- Diff: helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/ResourceAccessor.java --- @@ -84,10 +96,66 @@ public Response getResources(@PathParam("clusterId") String clusterId) { return JSONRepresentation(root); } + /** --- End diff -- Partition health related changes are not part of this PR (allow user to change instance config), can we file different issues? > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-692) Use map instead of set in controller's message cleanup logic
[ https://issues.apache.org/jira/browse/HELIX-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429130#comment-16429130 ] ASF GitHub Bot commented on HELIX-692: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/175 [HELIX-692] use map instead of list to avoid deleting redundant message during cleanup Currently in MessageGenerationPhase, we are using list to store messages to GC. However, pending message is stored per resource/partition/instance, and under batch message mode, same message is stored once for each partition in the batch, which lead to the fact that we are cleaning up same message a lot of times. This RB changes list to map to avoid redundant cleanup You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/HELIX-692 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #175 commit ce7d2e9d275e1403375edd94d63afa94ea1a2234 Author: Harry Zhang Date: 2018-04-06T23:27:02Z [HELIX-692] use map instead of list to avoid deleting redundant message during cleanup > Use map instead of set in controller's message cleanup logic > > > Key: HELIX-692 > URL: https://issues.apache.org/jira/browse/HELIX-692 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > This is to avoid duplicated cleans of same message, as under batch message > mode, we are storing same message under all resources and therefore causing > extra deletion api calls for same message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-691) Allow users to update InstanceConfig
[ https://issues.apache.org/jira/browse/HELIX-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427722#comment-16427722 ] ASF GitHub Bot commented on HELIX-691: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/174 [HELIX-691] Allow users to update InstanceConfig In helix-rest, we provide a method in InstanceAccessor, updateInstanceConfig, that updates the instance's config through a POST call. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix instConfig Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #174 commit 1ebdc45194dddecb675362e812aa795c876c9f6a Author: narendly Date: 2018-04-05T23:00:52Z [HELIX-691] Allow users to update InstanceConfig In helix-rest, we provide a method in InstanceAccessor, updateInstanceConfig, that updates the instance's config through a POST call. > Allow users to update InstanceConfig > > > Key: HELIX-691 > URL: https://issues.apache.org/jira/browse/HELIX-691 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > In helix-rest, we provide in InstanceAccessor a method updateInstanceConfig > updates the instance's config through a POST call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-689) Controller message cleanup is spitting too many logs
[ https://issues.apache.org/jira/browse/HELIX-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424634#comment-16424634 ] ASF GitHub Bot commented on HELIX-689: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/173 [HELIX-689] remove redundant logs from zkclient Currently, in controller message cleanup, we print out 2 lines of message when message does not exist, which is totally redundant. In this PR, I removed the warning message from controller, and added error message in zkclient only when there is real error (exception from below). If we fail to delete a ZNode because znode does not exist, we do not print out message any more except debug mode You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/ctl-msg-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #173 commit d96a40caf19efffed3939b6dd8d9efe40734ec15 Author: Harry Zhang Date: 2018-04-03T21:22:53Z [HELIX-689] remove redundant logs from zkclient > Controller message cleanup is spitting too many logs > > > Key: HELIX-689 > URL: https://issues.apache.org/jira/browse/HELIX-689 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently we print out error log when we fail to remove logs. However, due to > ZK client limitation, we are printing logs even when the message is already > deleted, which should not be regarded as a failure > Need to perform log cleanup and only print out log when there is real error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-688) Add method that returns start time of the most recent task scheduled
[ https://issues.apache.org/jira/browse/HELIX-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418292#comment-16418292 ] ASF GitHub Bot commented on HELIX-688: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/170 > Add method that returns start time of the most recent task scheduled > > > Key: HELIX-688 > URL: https://issues.apache.org/jira/browse/HELIX-688 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > getLastScheduledTaskTimestamp returns the timestamp for the start time of the > task that was scheduled last. Clients of Task Framework may use this API > against their time to completion metric to determine if a given > workflow/job/task is stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-688) Add method that returns start time of the most recent task scheduled
[ https://issues.apache.org/jira/browse/HELIX-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416462#comment-16416462 ] ASF GitHub Bot commented on HELIX-688: -- GitHub user narendly opened a pull request: https://github.com/apache/helix/pull/170 [HELIX-688] Add method that returns start time of the most recent tas… …k scheduled getLastScheduledTaskTimestamp returns the timestamp for the start time of the task that was scheduled last. Clients of Task Framework may use this API against their time to completion metric to determine if a given workflow/job/task is stuck. You can merge this pull request into a Git repository by running: $ git pull https://github.com/narendly/helix lasttask Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #170 commit 13b131f7c2ec4c40e92beecb2552969157bd4882 Author: narendly Date: 2018-03-27T23:25:22Z [HELIX-688] Add method that returns start time of the most recent task scheduled getLastScheduledTaskTimestamp returns the timestamp for the start time of the task that was scheduled last. Clients of Task Framework may use this API against their time to completion metric to determine if a given workflow/job/task is stuck. > Add method that returns start time of the most recent task scheduled > > > Key: HELIX-688 > URL: https://issues.apache.org/jira/browse/HELIX-688 > Project: Apache Helix > Issue Type: Improvement >Reporter: Hunter L >Priority: Major > > getLastScheduledTaskTimestamp returns the timestamp for the start time of the > task that was scheduled last. Clients of Task Framework may use this API > against their time to completion metric to determine if a given > workflow/job/task is stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-683) Clean monitoring cache upon helix controller enable monitoring
[ https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414712#comment-16414712 ] ASF GitHub Bot commented on HELIX-683: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/162 > Clean monitoring cache upon helix controller enable monitoring > -- > > Key: HELIX-683 > URL: https://issues.apache.org/jira/browse/HELIX-683 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > We found a bug in reporting cluster status, partition masterless duration. > The root cause is that the duration is calculated based on controller cache. > And currently, this cache is not cleaned when leadership is changed. As a > result, if controller A start a mastership handoff but was interrupted once, > the start time will be kept in cache until next mastership handoff on the > same partition happens. Then the later handoff duration will be calculated > based on the stale start time. This could be super large. > To fix it, we might consider clean cache when leadership changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-683) Clean monitoring cache upon helix controller enable monitoring
[ https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414392#comment-16414392 ] ASF GitHub Bot commented on HELIX-683: -- GitHub user zhan849 opened a pull request: https://github.com/apache/helix/pull/162 [HELIX-683] clean monitoring cache upon helix controller enable monitoring In this PR I added methods to clear monitoring records in cache when we enable cluster status monitoring. I also added tests to reproduce situation that a resource missed top state, controller lost leadership, resource regain top state, controller regain leadership, which will cause a metrics reporting problem You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhan849/helix harry/controller-monitor-cache-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #162 commit 373da77547fa1ea4a39c760e80da75e9d453d4f5 Author: Harry Zhang Date: 2018-03-26T19:14:07Z [HELIX-683] clean monitoring cache upon helix controller enable monitoring > Clean monitoring cache upon helix controller enable monitoring > -- > > Key: HELIX-683 > URL: https://issues.apache.org/jira/browse/HELIX-683 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > We found a bug in reporting cluster status, partition masterless duration. > The root cause is that the duration is calculated based on controller cache. > And currently, this cache is not cleaned when leadership is changed. As a > result, if controller A start a mastership handoff but was interrupted once, > the start time will be kept in cache until next mastership handoff on the > same partition happens. Then the later handoff duration will be calculated > based on the stale start time. This could be super large. > To fix it, we might consider clean cache when leadership changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HELIX-681) Participant should not fail state transition on fail to delete / relay message
[ https://issues.apache.org/jira/browse/HELIX-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412891#comment-16412891 ] ASF GitHub Bot commented on HELIX-681: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/152 > Participant should not fail state transition on fail to delete / relay message > -- > > Key: HELIX-681 > URL: https://issues.apache.org/jira/browse/HELIX-681 > Project: Apache Helix > Issue Type: Bug >Reporter: Hao Zhang >Priority: Major > > Currently we have a general try-catch block in HelixTask and > HelixTaskExecutor, which, upon any exception thrown from state transition > routine, will fail state transition. However there are at least the following > cases in which state transition should be considered as successful: > * When we fail to delete message after successfully handled message and > updated current state -> this is because we already completed state > transition and current state is consistent between participant and ZK > * When we fail to send out relay message > as relay message provides only > best effort of delivering messages, which has nothing to do with state > transition's results. In case of fail to relay message, controller will > resend message which ensures correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005)