[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111240#comment-14111240 ] ASF GitHub Bot commented on HELIX-470: -- GitHub user brandtg opened a pull request: https://github.com/apache/helix/pull/2 [HELIX-470] Netty-based IPC layer You can merge this pull request into a Git repository by running: $ git pull https://github.com/brandtg/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/2.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2 commit f2475fa9a6123052fea2588cdd4e439ddc7af020 Author: Greg Brandt Date: 2014-08-26T20:14:36Z [HELIX-470] Netty-based IPC layer > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111314#comment-14111314 ] ASF GitHub Bot commented on HELIX-470: -- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16741557 --- Diff: helix-ipc/LICENSE --- @@ -0,0 +1,273 @@ + --- End diff -- Do not need extra LICENSE file here since the main LICENSE file will be used when doing the packaging > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113954#comment-14113954 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/2#issuecomment-53754601 If no one has any objections, I plan to merge this for our 0.7.1 beta release so that we can increase collaboration on it and move things forward iteratively. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113974#comment-14113974 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16853293 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- Since we release Helix as per-module binary bundles, personally I think having specific NOTICE files makes more sense -- that way, people can pick and choose which modules to use without needing to worry about dependencies present in other modules. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113983#comment-14113983 ] ASF GitHub Bot commented on HELIX-470: -- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16853492 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- So is this going to separate source zip? If this would be under the parent umbrella of main source the NOTICE need to be on top level directory [1] [1] http://www.apache.org/legal/src-headers.html > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113972#comment-14113972 ] ASF GitHub Bot commented on HELIX-470: -- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16853182 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- Forgot to comment this one. Please move NOTICE file content to NOTICE in top level Helix directory. Then you can remove this one. NOTICE file should be bundled into one when making releases and just copy this to main NOTICE will make packaging easier. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113997#comment-14113997 ] ASF GitHub Bot commented on HELIX-470: -- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16854005 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- @kanakb, sorry looks like Helix already put NOTICE file in each module that released separately. I guess if we package this separately then NOTICE for this module should stay but I think the top level NOTICE file should have copy of the content. Does each NOTICE file in the sub module does not have copy to top NOTICE file? > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113999#comment-14113999 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16854152 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- @hsaputra Yes, the top-level NOTICE is currently a superset of all submodule NOTICE files, though this is done manually. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113998#comment-14113998 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16854042 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- Ah, I think we have things somewhat backwards. Right now all of our submodules have LICENSE and NOTICE files, but I guess these only need to be at the top level? Helix has a single source release and per-module binary releases. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114012#comment-14114012 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16854782 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- In any case, we need to figure out what is right in general, and apply it to all our submodules. I have created https://issues.apache.org/jira/browse/HELIX-509 to track this work. > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114027#comment-14114027 ] ASF GitHub Bot commented on HELIX-470: -- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16855212 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- Sounds good, for this PR we could keep the NOTICE file to be inline with others. @brandtg, could you add copy of this module NOTICE file to the top level one like other modules? > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114032#comment-14114032 ] ASF GitHub Bot commented on HELIX-470: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/2#discussion_r16855377 --- Diff: helix-ipc/NOTICE --- @@ -0,0 +1,33 @@ +Apache Helix --- End diff -- @hsaputra Greg is on vacation; I can take care of that during the merge > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-470) Add performant IPC (Helix actors)
[ https://issues.apache.org/jira/browse/HELIX-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114067#comment-14114067 ] ASF GitHub Bot commented on HELIX-470: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/2 > Add performant IPC (Helix actors) > - > > Key: HELIX-470 > URL: https://issues.apache.org/jira/browse/HELIX-470 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.7.1, 0.6.4 >Reporter: Greg Brandt > > Helix is missing a high-performance way to exchange messages among resource > partitions, with a user-friendly API. > Currently, the Helix messaging service relies on creating many nodes in > ZooKeeper, which can lead to ZooKeeper outages if messages are sent too > frequently. > In order to avoid this, high-performance NIO-based {{HelixActors}} should be > implemented (in rough accordance with the actor model). {{HelixActors}} > exchange messages asynchronously without waiting for a response, and are > partition/state-addressable. > The API would look something like this: > {code} > public interface HelixActor { > void send(Partition partition, String state, T message); > void register(String resource, HelixActorCallback callback); > } > public interface HelixActorCallback { > void onMessage(Partition partition, State state, T message); > } > {code} > {{#send}} should likely support wildcards for partition number and state, or > its method signature might need to be massaged a little bit for more > flexibility. But that's the basic idea. > Nothing is inferred about the format of the messages - the only metadata we > need to be able to interpret is (1) partition name and (2) state. The user > provides a codec to encode / decode messages, so it's nicer to implement > {{HelixActor#send}} and {{HelixActorCallback#onMessage}}. > {code} > public interface HelixActorMessageCodec { > byte[] encode(T message); > T decode(byte[] message); > } > {code} > Actors should support somewhere around 100k to 1M messages per second. The > Netty framework is a potential implementation candidate, but should be > thoroughly evaluated w.r.t. performance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155861#comment-14155861 ] ASF GitHub Bot commented on HELIX-524: -- GitHub user dayzzz opened a pull request: https://github.com/apache/helix/pull/6 [HELIX-524] Add a getProgress to the Task interface [HELIX-524] Add a getProgress to the Task interface, this is very helpful for long running tasks, from which we know the status of a task and see if it's blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a task is finished You can merge this pull request into a Git repository by running: $ git pull https://github.com/dayzzz/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/6.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6 commit 5d1b27f81abc4c67301d864f122e82f5a0ce49c3 Author: Hongbo Zeng Date: 2014-10-02T00:19:49Z [HELIX-524] Add a getProgress to the Task interface > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156118#comment-14156118 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-57588016 I like this in general, but I have a couple questions/comments: * This is an interface change. Is there a way we can provide this functionality without requiring that all existing task implementations be rewritten? * What is the persistence story (if any)? Who calls getProgress? > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156204#comment-14156204 ] ASF GitHub Bot commented on HELIX-524: -- Github user brandtg commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-57596146 For persistence story, what about putting a new health report: `/STAND_ALONE/INSTANCES/{instanceName}/HEALTHREPORT/taskStatus`? Another thing to consider is that adding `#getProgress` to the task interface requires users to be able to intelligently report task status. I suspect that in practice someone might be annoyed at this extra responsibility, and provide dummy numbers (sounds stupid, but saw it before). Maybe a better approach would be to try to monitor on things we know a priori about task (e.g. lifetime) and provide tools to inspect ones that seem stuck (e.g. task/partition-addressable stack trace)? > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156218#comment-14156218 ] ASF GitHub Bot commented on HELIX-524: -- Github user brandtg commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-57597306 Wait, that stuff's gone now? Ok never mind. Is persistence of the state of a running task completely necessary? (Like would JMX suffice?) > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157419#comment-14157419 ] ASF GitHub Bot commented on HELIX-524: -- Github user dayzzz commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-57728028 * About the interface change, IMHO, this is a nature part of the task interface. If the existing customers want to update, adding the support is encouraged. We can have another interface for monitoring the progress, but doesn't seem to be a nice design. * The task framework should call the progress interface and expose the status somehow, discussed below. * Persistency story, ZK is a good place if we only want to record the final result. If we want to expose the progress as a task runs, putting these periodical status updates in ZK is not a choice due to the large traffic, generally ZK is not a good place for reporting and monitoring service status. I also discussed this with Jason, we thought about inGraph (which is not an option for open source), Kafka or Riemann. (Greg, JMX sounds a good idea.) Without a conclusion of where to put these status stats, I agree that the progress interface is not of much value. For the first step, it would be good enough if we can monitor the progress. What do you guys think? * For the bogus progress number, it's the customer themselves who need to track the progress, if they want to see the bogus value, I'm fine with that :). The controller should be set not to act on the bogus value by the customers. > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157483#comment-14157483 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-57730714 Why not have an abstract ProgressReportingTask that implements task and includes all the JMX persistence, and then have tasks extends from that and implement getProgress()? That avoids the interface change, but also allows "smarter" tasks to let their progress be known. > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164230#comment-14164230 ] ASF GitHub Bot commented on HELIX-524: -- Github user dayzzz commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58433144 Sorry for the delay. Are there any customers outside the Espresso team? If we are the only customer right now, it would not be too costly to update the implementation with the interface. Making the getProgress into the interface has a good thing that it naturally goes into TaskRunner and we don't need to do something like "instance of ProgressReportingTask" and then call the getProgress, or a subclass of TaskRunner which is specific for ProgressReportingTask and call the getProgress. > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164469#comment-14164469 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58448716 Since this is an open source project, the real answer to this question is "I don't know." We've made two public releases with this change, and so if there is anyone who has integrated or is planning to integrate with the task framework, they would need to made a change. If `ProgressReportingTask` handles the JMX reporting itself (it's an abstract class, not an interface), then `TaskRunner` wouldn't need to care about progress at all. And back to the "bogus" return value discussion, I think customers who don't have a good sense of progress shouldn't be required to change their integration just to implement `getProgress()`. > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164660#comment-14164660 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58459371 Email from Kishore: ``` Apologize to throw in another idea into the mix. Why not have additional methods in the implementation that the agent can invoke on demand based on annotation?. This extends the way we invoke methods on the statemodel. For example we have @Transition(TO=MASTER, FROM=SLAVE) void fromSlaveTOMaster(Message m, NotificationContext ctx){ } Can we have something similar as @Method(name="getProgress") Response getProgress(Message m, NotificationContext ctx){ } Helix provides a generic way to invoke a method on a partition. I think this is more powerful , does not disturb any interfaces and can be extending to do custom stuff. Also these methods will not be invoked via ZK, instead the controller can directly invoke the method on the participant. Feedback? ``` > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164661#comment-14164661 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58459614 I think having a `@ProgressReporter` could be an elegant way to allow users to plug in additional functionality. Progress could be just the first thing a task could expose, and in the future if we think of others, we just need to add a new annotation. > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165886#comment-14165886 ] ASF GitHub Bot commented on HELIX-524: -- Github user zzhang5 commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58587902 If we are not using ZK to invoke these methods, are we opening some kind of end-point e.g. via Netty or JMX on each participant? > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-524) add getProgress() to Task interface
[ https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167167#comment-14167167 ] ASF GitHub Bot commented on HELIX-524: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/6#issuecomment-58689390 Kishore from JIRA comments: ``` Yes, thats the idea. With helix-ipc this will be possible rt? This can be extended to write rebalancers that talk to nodes to get the high water mark to decide new master. what do you think? ``` Jason from JIRA comments: ``` Yes. Helix-IPC will do the job. We can also extend the idea to get high water mark, etc. Helix task frame exposes Task interface, so users are not implementing StateModel directly. How can we make this available to Task also? ``` > add getProgress() to Task interface > --- > > Key: HELIX-524 > URL: https://issues.apache.org/jira/browse/HELIX-524 > Project: Apache Helix > Issue Type: Improvement > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hongbo Zeng > Fix For: 0.6.5 > > > Add a getProgress to the Task interface, this is very helpful for long > running tasks, from which we know the status of a task and see if it's > blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates a > task is finished -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-525) Drop a partition from resource ideal-state shall bring partition to initial state and then DROPPED state
[ https://issues.apache.org/jira/browse/HELIX-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177505#comment-14177505 ] ASF GitHub Bot commented on HELIX-525: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/7 [HELIX-525] Add integration tests to verify that dropping a partition from resource ... Add integration tests to verify that dropping a partition from resource ideal-state should bring partition to initial state and then DROPPED state (for AUTO, SEMI_AUTO, and CUSTOM modes). You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/7.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7 commit 51b757b71693ec2f774035eac441dd51956e5862 Author: Lei Xia Date: 2014-10-20T21:09:16Z Add integration tests to verify that dropping a partition from resource ideal-state should bring partition to initial state and then DROPPED state (for AUTO, SEMI_AUTO, and CUSTOM modes). > Drop a partition from resource ideal-state shall bring partition to initial > state and then DROPPED state > > > Key: HELIX-525 > URL: https://issues.apache.org/jira/browse/HELIX-525 > Project: Apache Helix > Issue Type: Bug >Reporter: Zhen Zhang > Original Estimate: 24h > Remaining Estimate: 24h > > If we manually remove a partition from ideal-state, Helix should bring the > partition to initial state (e.g. OFFLINE) on all hosts and then to DROPPED > state. > - Add an integration test to verify this (for AUTO, SEMI_AUTO, and CUSTOM > modes) > - Fix it if not behave in the expected way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-525) Drop a partition from resource ideal-state shall bring partition to initial state and then DROPPED state
[ https://issues.apache.org/jira/browse/HELIX-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178000#comment-14178000 ] ASF GitHub Bot commented on HELIX-525: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/7#discussion_r19130517 --- Diff: helix-core/src/test/java/org/apache/helix/integration/TestDrop.java --- @@ -481,11 +507,27 @@ public void testDropSinglePartitionSemiAuto() throws Exception { 4, // partitions per resource n, // number of nodes 2, // replicas -"MasterSlave", true); // do rebalance +"MasterSlave", mode, (IdealState.RebalanceMode.FULL_AUTO.equals(mode) || IdealState.RebalanceMode.SEMI_AUTO +.equals(mode))); // do rebalance only when it is in AUTO or SEMI-AUTO mode --- End diff -- Why is the default rebalance behavior insufficient for CUSTOMIZED mode? > Drop a partition from resource ideal-state shall bring partition to initial > state and then DROPPED state > > > Key: HELIX-525 > URL: https://issues.apache.org/jira/browse/HELIX-525 > Project: Apache Helix > Issue Type: Bug >Reporter: Zhen Zhang > Original Estimate: 24h > Remaining Estimate: 24h > > If we manually remove a partition from ideal-state, Helix should bring the > partition to initial state (e.g. OFFLINE) on all hosts and then to DROPPED > state. > - Add an integration test to verify this (for AUTO, SEMI_AUTO, and CUSTOM > modes) > - Fix it if not behave in the expected way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-525) Drop a partition from resource ideal-state shall bring partition to initial state and then DROPPED state
[ https://issues.apache.org/jira/browse/HELIX-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179299#comment-14179299 ] ASF GitHub Bot commented on HELIX-525: -- Github user lei-xia commented on the pull request: https://github.com/apache/helix/pull/7#issuecomment-60014784 It is actually not necessary, thanks for pointing out. I updated it, also use TestNG dataprovider to supply RebalanceMode to avoid write one test method for each mode. I will apply the same strategy to many of our existing tests to testing code redundancy in following checkins if you guys are happy with it. > Drop a partition from resource ideal-state shall bring partition to initial > state and then DROPPED state > > > Key: HELIX-525 > URL: https://issues.apache.org/jira/browse/HELIX-525 > Project: Apache Helix > Issue Type: Bug >Reporter: Zhen Zhang > Original Estimate: 24h > Remaining Estimate: 24h > > If we manually remove a partition from ideal-state, Helix should bring the > partition to initial state (e.g. OFFLINE) on all hosts and then to DROPPED > state. > - Add an integration test to verify this (for AUTO, SEMI_AUTO, and CUSTOM > modes) > - Fix it if not behave in the expected way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-525) Drop a partition from resource ideal-state shall bring partition to initial state and then DROPPED state
[ https://issues.apache.org/jira/browse/HELIX-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179519#comment-14179519 ] ASF GitHub Bot commented on HELIX-525: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/7#issuecomment-60031395 LGTM > Drop a partition from resource ideal-state shall bring partition to initial > state and then DROPPED state > > > Key: HELIX-525 > URL: https://issues.apache.org/jira/browse/HELIX-525 > Project: Apache Helix > Issue Type: Bug >Reporter: Zhen Zhang > Original Estimate: 24h > Remaining Estimate: 24h > > If we manually remove a partition from ideal-state, Helix should bring the > partition to initial state (e.g. OFFLINE) on all hosts and then to DROPPED > state. > - Add an integration test to verify this (for AUTO, SEMI_AUTO, and CUSTOM > modes) > - Fix it if not behave in the expected way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-525) Drop a partition from resource ideal-state shall bring partition to initial state and then DROPPED state
[ https://issues.apache.org/jira/browse/HELIX-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179526#comment-14179526 ] ASF GitHub Bot commented on HELIX-525: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/7 > Drop a partition from resource ideal-state shall bring partition to initial > state and then DROPPED state > > > Key: HELIX-525 > URL: https://issues.apache.org/jira/browse/HELIX-525 > Project: Apache Helix > Issue Type: Bug >Reporter: Zhen Zhang > Original Estimate: 24h > Remaining Estimate: 24h > > If we manually remove a partition from ideal-state, Helix should bring the > partition to initial state (e.g. OFFLINE) on all hosts and then to DROPPED > state. > - Add an integration test to verify this (for AUTO, SEMI_AUTO, and CUSTOM > modes) > - Fix it if not behave in the expected way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-537) org.apache.helix.task.TaskStateModel should have a shutdown method.
[ https://issues.apache.org/jira/browse/HELIX-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206903#comment-14206903 ] ASF GitHub Bot commented on HELIX-537: -- GitHub user atcurtis opened a pull request: https://github.com/apache/helix/pull/8 [HELIX-537] Shutdown executors https://issues.apache.org/jira/browse/HELIX-537 You can merge this pull request into a Git repository by running: $ git pull https://github.com/atcurtis/helix HELIX-537 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/8.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8 commit d6fbcf17e01439a290a0cc273933660274841663 Author: Antony T Curtis Date: 2014-11-11T19:17:43Z [HELIX-537] Shutdown executors > org.apache.helix.task.TaskStateModel should have a shutdown method. > --- > > Key: HELIX-537 > URL: https://issues.apache.org/jira/browse/HELIX-537 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > There should be a shutdown method to terminate the Timer and Executor which > the org.apache.helix.task.TaskStateModel class creates. > ie. > {noformat} > public boolean shutdown(long timeout, TimeUnit unit) > throws InterruptedException > { > reset(); > _taskExecutor.shutdown(); > _timer.cancel(); > return _taskExecutor.awaitTermination(timeout, unit); > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-537) org.apache.helix.task.TaskStateModel should have a shutdown method.
[ https://issues.apache.org/jira/browse/HELIX-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207673#comment-14207673 ] ASF GitHub Bot commented on HELIX-537: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/8#issuecomment-62669496 ```[INFO] - [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /Users/kanak/Developer/incubator-helix/helix-core/src/main/java/org/apache/helix/task/TaskStateModel.java:[70,2] missing return statement [INFO] 1 error [INFO] - [INFO] [INFO] BUILD FAILURE``` `TaskStateModel#shutdown` has return type `boolean`, but returns nothing. > org.apache.helix.task.TaskStateModel should have a shutdown method. > --- > > Key: HELIX-537 > URL: https://issues.apache.org/jira/browse/HELIX-537 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > There should be a shutdown method to terminate the Timer and Executor which > the org.apache.helix.task.TaskStateModel class creates. > ie. > {noformat} > public boolean shutdown(long timeout, TimeUnit unit) > throws InterruptedException > { > reset(); > _taskExecutor.shutdown(); > _timer.cancel(); > return _taskExecutor.awaitTermination(timeout, unit); > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-537) org.apache.helix.task.TaskStateModel should have a shutdown method.
[ https://issues.apache.org/jira/browse/HELIX-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207674#comment-14207674 ] ASF GitHub Bot commented on HELIX-537: -- Github user atcurtis commented on the pull request: https://github.com/apache/helix/pull/8#issuecomment-62669668 Oops, I pushed the wrong commit to github. I’ll force push the correct one to my repo. On Nov 11, 2014, at 8:42 PM, Kanak Biscuitwala wrote: > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] /Users/kanak/Developer/incubator-helix/helix-core/src/main/java/org/apache/helix/task/TaskStateModel.java:[70,2] missing return statement > [INFO] 1 error > [INFO] - > [INFO] > [INFO] BUILD FAILURE > > TaskStateModel#shutdown has return type boolean, but returns nothing. > > — > Reply to this email directly or view it on GitHub. > > org.apache.helix.task.TaskStateModel should have a shutdown method. > --- > > Key: HELIX-537 > URL: https://issues.apache.org/jira/browse/HELIX-537 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > There should be a shutdown method to terminate the Timer and Executor which > the org.apache.helix.task.TaskStateModel class creates. > ie. > {noformat} > public boolean shutdown(long timeout, TimeUnit unit) > throws InterruptedException > { > reset(); > _taskExecutor.shutdown(); > _timer.cancel(); > return _taskExecutor.awaitTermination(timeout, unit); > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-537) org.apache.helix.task.TaskStateModel should have a shutdown method.
[ https://issues.apache.org/jira/browse/HELIX-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207681#comment-14207681 ] ASF GitHub Bot commented on HELIX-537: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/8#issuecomment-62670141 LGTM, tests pass, will merge. > org.apache.helix.task.TaskStateModel should have a shutdown method. > --- > > Key: HELIX-537 > URL: https://issues.apache.org/jira/browse/HELIX-537 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > There should be a shutdown method to terminate the Timer and Executor which > the org.apache.helix.task.TaskStateModel class creates. > ie. > {noformat} > public boolean shutdown(long timeout, TimeUnit unit) > throws InterruptedException > { > reset(); > _taskExecutor.shutdown(); > _timer.cancel(); > return _taskExecutor.awaitTermination(timeout, unit); > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-537) org.apache.helix.task.TaskStateModel should have a shutdown method.
[ https://issues.apache.org/jira/browse/HELIX-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207745#comment-14207745 ] ASF GitHub Bot commented on HELIX-537: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/8 > org.apache.helix.task.TaskStateModel should have a shutdown method. > --- > > Key: HELIX-537 > URL: https://issues.apache.org/jira/browse/HELIX-537 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > There should be a shutdown method to terminate the Timer and Executor which > the org.apache.helix.task.TaskStateModel class creates. > ie. > {noformat} > public boolean shutdown(long timeout, TimeUnit unit) > throws InterruptedException > { > reset(); > _taskExecutor.shutdown(); > _timer.cancel(); > return _taskExecutor.awaitTermination(timeout, unit); > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-549) Discarding Throwable exceptions makes threads unkillable.
[ https://issues.apache.org/jira/browse/HELIX-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215180#comment-14215180 ] ASF GitHub Bot commented on HELIX-549: -- GitHub user atcurtis opened a pull request: https://github.com/apache/helix/pull/9 [HELIX-549] Rethrow ThreadDeath instead of discarding. https://issues.apache.org/jira/browse/HELIX-549 You can merge this pull request into a Git repository by running: $ git pull https://github.com/atcurtis/helix HELIX-549 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/9.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9 > Discarding Throwable exceptions makes threads unkillable. > - > > Key: HELIX-549 > URL: https://issues.apache.org/jira/browse/HELIX-549 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > Threads in loops which catch and discard Throwable end up discarding > ThreadDeath exceptions. This causes Thread.stop() to be effectively ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215282#comment-14215282 ] ASF GitHub Bot commented on HELIX-550: -- GitHub user atcurtis opened a pull request: https://github.com/apache/helix/pull/10 [HELIX-550] ZKHelixManager should shutdown GenericHelixController. https://issues.apache.org/jira/browse/HELIX-550 You can merge this pull request into a Git repository by running: $ git pull https://github.com/atcurtis/helix HELIX-550 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/10.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10 commit 073adf5e9da0d9508cfbc42df4fde460055db714 Author: Antony T Curtis Date: 2014-11-17T22:04:44Z [HELIX-550] ZKHelixManager should shutdown GenericHelixController. > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215426#comment-14215426 ] ASF GitHub Bot commented on HELIX-550: -- Github user atcurtis closed the pull request at: https://github.com/apache/helix/pull/10 > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215498#comment-14215498 ] ASF GitHub Bot commented on HELIX-550: -- GitHub user atcurtis opened a pull request: https://github.com/apache/helix/pull/11 [HELIX-550] ZKHelixManager should shutdown GenericHelixController. https://issues.apache.org/jira/browse/HELIX-550 tested with mvn test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/atcurtis/helix HELIX-550 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/11.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11 commit af882ea025b1daf821f9f17f969a587ca7ec3e17 Author: Antony T Curtis Date: 2014-11-17T22:04:44Z [HELIX-550] ZKHelixManager should shutdown GenericHelixController. > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-549) Discarding Throwable exceptions makes threads unkillable.
[ https://issues.apache.org/jira/browse/HELIX-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217380#comment-14217380 ] ASF GitHub Bot commented on HELIX-549: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/9#issuecomment-63589932 LGTM, merged > Discarding Throwable exceptions makes threads unkillable. > - > > Key: HELIX-549 > URL: https://issues.apache.org/jira/browse/HELIX-549 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > Threads in loops which catch and discard Throwable end up discarding > ThreadDeath exceptions. This causes Thread.stop() to be effectively ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217391#comment-14217391 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557176 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Can you change this to a constant variable? > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217393#comment-14217393 ] ASF GitHub Bot commented on HELIX-550: -- Github user atcurtis commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557205 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Sure. Any preference for the constant name? > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217396#comment-14217396 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557259 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Maybe something like `EVENT_THREAD_JOIN_TIMEOUT`? > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217397#comment-14217397 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557289 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java --- @@ -543,6 +554,19 @@ public void disconnect() { _zkclient.close(); _zkclient = null; LOG.info("Cluster manager: " + _instanceName + " disconnected"); + + if (_controller != null) { +try { + _controller.shutdown(); +} --- End diff -- nit: can you make the `catch` start on the same line as the close brace of the `try`? > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217400#comment-14217400 ] ASF GitHub Bot commented on HELIX-550: -- Github user atcurtis commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557401 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java --- @@ -543,6 +554,19 @@ public void disconnect() { _zkclient.close(); _zkclient = null; LOG.info("Cluster manager: " + _instanceName + " disconnected"); + + if (_controller != null) { +try { + _controller.shutdown(); +} --- End diff -- np. > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217412#comment-14217412 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/11#issuecomment-63591519 Merged, thanks! > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-549) Discarding Throwable exceptions makes threads unkillable.
[ https://issues.apache.org/jira/browse/HELIX-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218074#comment-14218074 ] ASF GitHub Bot commented on HELIX-549: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/9 > Discarding Throwable exceptions makes threads unkillable. > - > > Key: HELIX-549 > URL: https://issues.apache.org/jira/browse/HELIX-549 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > Threads in loops which catch and discard Throwable end up discarding > ThreadDeath exceptions. This causes Thread.stop() to be effectively ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218073#comment-14218073 ] ASF GitHub Bot commented on HELIX-550: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/11 > ZKHelixManager does not shutdown GenericHelixController threads. > > > Key: HELIX-550 > URL: https://issues.apache.org/jira/browse/HELIX-550 > Project: Apache Helix > Issue Type: Bug >Reporter: Antony T Curtis >Priority: Critical > > ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-555) ClusterStateVerifier leaks ZkClients.
[ https://issues.apache.org/jira/browse/HELIX-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220509#comment-14220509 ] ASF GitHub Bot commented on HELIX-555: -- GitHub user atcurtis opened a pull request: https://github.com/apache/helix/pull/12 [HELIX-555] Fix deficiency in ClusterStateVerifier api https://issues.apache.org/jira/browse/HELIX-555 You can merge this pull request into a Git repository by running: $ git pull https://github.com/atcurtis/helix HELIX-555 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/12.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12 commit b2794e744e945da966c7ac6ae6408636951281d3 Author: Antony T Curtis Date: 2014-11-21T04:36:01Z [HELIX-555] Fix deficiency in ClusterStateVerifier api > ClusterStateVerifier leaks ZkClients. > - > > Key: HELIX-555 > URL: https://issues.apache.org/jira/browse/HELIX-555 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > The classes in ClusterStateVerifier tend to leak ZkClients because there is > no way to provide an already constructed client to the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-555) ClusterStateVerifier leaks ZkClients.
[ https://issues.apache.org/jira/browse/HELIX-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220525#comment-14220525 ] ASF GitHub Bot commented on HELIX-555: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/12#issuecomment-63924711 It looks like the original code already used `ZkClientPool`. How does this change improve the situation? > ClusterStateVerifier leaks ZkClients. > - > > Key: HELIX-555 > URL: https://issues.apache.org/jira/browse/HELIX-555 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > The classes in ClusterStateVerifier tend to leak ZkClients because there is > no way to provide an already constructed client to the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-555) ClusterStateVerifier leaks ZkClients.
[ https://issues.apache.org/jira/browse/HELIX-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220527#comment-14220527 ] ASF GitHub Bot commented on HELIX-555: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/12#issuecomment-63924907 Or are there a bunch of tests/tools that create their own ZK client in addition to the one created by `ClusterStateVerifier`? > ClusterStateVerifier leaks ZkClients. > - > > Key: HELIX-555 > URL: https://issues.apache.org/jira/browse/HELIX-555 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > The classes in ClusterStateVerifier tend to leak ZkClients because there is > no way to provide an already constructed client to the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-555) ClusterStateVerifier leaks ZkClients.
[ https://issues.apache.org/jira/browse/HELIX-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220556#comment-14220556 ] ASF GitHub Bot commented on HELIX-555: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/12#issuecomment-63927164 OK, in that case, merged. > ClusterStateVerifier leaks ZkClients. > - > > Key: HELIX-555 > URL: https://issues.apache.org/jira/browse/HELIX-555 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > The classes in ClusterStateVerifier tend to leak ZkClients because there is > no way to provide an already constructed client to the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-555) ClusterStateVerifier leaks ZkClients.
[ https://issues.apache.org/jira/browse/HELIX-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221098#comment-14221098 ] ASF GitHub Bot commented on HELIX-555: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/12 > ClusterStateVerifier leaks ZkClients. > - > > Key: HELIX-555 > URL: https://issues.apache.org/jira/browse/HELIX-555 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3 >Reporter: Antony T Curtis >Priority: Blocker > > The classes in ClusterStateVerifier tend to leak ZkClients because there is > no way to provide an already constructed client to the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-569) Update website docs to correctly pass the rebalance mode to addResource in helix-admin.sh
[ https://issues.apache.org/jira/browse/HELIX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323126#comment-14323126 ] ASF GitHub Bot commented on HELIX-569: -- GitHub user acmcelwee opened a pull request: https://github.com/apache/helix/pull/16 [HELIX-569] - Update docs to correctly pass rebalance mode during helix-admin.sh resource creation I stumbled upon this the other day in #apachehelix irc. I couldn't get the distributed locks example to work in full auto rebalance mode, and it turns out the docs just needed an update. Since the default rebalance mode since 0.6.2 is SEMI_AUTO, my testing halfway worked and exhibited very confusing behavior. This updates the docs to keep future new users headed in the right direction. https://issues.apache.org/jira/browse/HELIX-569 You can merge this pull request into a Git repository by running: $ git pull https://github.com/acmcelwee/helix helix-569 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/16.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16 commit 2525ecd8f37892201fc427450b85c342db55070c Author: Adam McElwee Date: 2015-02-16T02:09:25Z [HELIX-569] - Update docs to correctly pass rebalance mode during helix-admin.sh resource creation > Update website docs to correctly pass the rebalance mode to addResource in > helix-admin.sh > - > > Key: HELIX-569 > URL: https://issues.apache.org/jira/browse/HELIX-569 > Project: Apache Helix > Issue Type: Bug >Reporter: Adam McElwee >Priority: Trivial > Fix For: 0.6.1-incubating, 0.6.2-incubating, 0.7.0-incubating, > 0.7.1, 0.6.3, 0.6.4 > > Attachments: fix-rebalance-mode-arg.patch > > > I stumbled upon this the other day in #apachehelix irc. I couldn't get the > distributed locks example to work in full auto rebalance mode, and it turns > out the docs just needed an update. Patch incoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-569) Update website docs to correctly pass the rebalance mode to addResource in helix-admin.sh
[ https://issues.apache.org/jira/browse/HELIX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327006#comment-14327006 ] ASF GitHub Bot commented on HELIX-569: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/16#issuecomment-75000395 Thanks! > Update website docs to correctly pass the rebalance mode to addResource in > helix-admin.sh > - > > Key: HELIX-569 > URL: https://issues.apache.org/jira/browse/HELIX-569 > Project: Apache Helix > Issue Type: Bug >Reporter: Adam McElwee >Priority: Trivial > Fix For: 0.6.1-incubating, 0.6.2-incubating, 0.7.0-incubating, > 0.7.1, 0.6.3, 0.6.4 > > Attachments: fix-rebalance-mode-arg.patch > > > I stumbled upon this the other day in #apachehelix irc. I couldn't get the > distributed locks example to work in full auto rebalance mode, and it turns > out the docs just needed an update. Patch incoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-569) Update website docs to correctly pass the rebalance mode to addResource in helix-admin.sh
[ https://issues.apache.org/jira/browse/HELIX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327008#comment-14327008 ] ASF GitHub Bot commented on HELIX-569: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/16 > Update website docs to correctly pass the rebalance mode to addResource in > helix-admin.sh > - > > Key: HELIX-569 > URL: https://issues.apache.org/jira/browse/HELIX-569 > Project: Apache Helix > Issue Type: Bug >Reporter: Adam McElwee >Priority: Trivial > Fix For: 0.6.1-incubating, 0.6.2-incubating, 0.7.0-incubating, > 0.7.1, 0.6.3, 0.6.4 > > Attachments: fix-rebalance-mode-arg.patch > > > I stumbled upon this the other day in #apachehelix irc. I couldn't get the > distributed locks example to work in full auto rebalance mode, and it turns > out the docs just needed an update. Patch incoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-578) NPE while deleting a job from a recurrent job queue
[ https://issues.apache.org/jira/browse/HELIX-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368037#comment-14368037 ] ASF GitHub Bot commented on HELIX-578: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/18 [HELIX-578] NPE while deleting a job from a recurrent job queue. This is to fix the deletion job operation when trying to delete a job from a recurrent job queue. New unit test added. mvn install package passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/18.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18 commit 944b16387b3ae5cf622b9d785dba17863341c084 Author: Lei Xia Date: 2015-03-18T06:13:55Z [HELIX-578] NPE while deleting a job from a recurrent job queue. > NPE while deleting a job from a recurrent job queue > --- > > Key: HELIX-578 > URL: https://issues.apache.org/jira/browse/HELIX-578 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek >Assignee: Lei Xia >Priority: Critical > > Helix throws an NPE when we try to delete a job from recurrent job queue. > Partial stacktrace: > java.lang.NullPointerException > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:295) > Helix is looking for workflow context's current state. > WorkflowContext wCtx = TaskUtil.getWorkflowContext(_propertyStore, queueName); > String workflowState = > (wCtx != null) ? wCtx.getWorkflowState().name() : > TaskState.NOT_STARTED.name(); > But for a recurring workflow, there is no "state" in the parent workflow's > context. Only the scheduled workflows will have a "state". Hence the NPE. > To ensure that queue is stopped, Helix should look at the context of > last-scheduled-workflow instead of the parent workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-578) NPE while deleting a job from a recurrent job queue
[ https://issues.apache.org/jira/browse/HELIX-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370043#comment-14370043 ] ASF GitHub Bot commented on HELIX-578: -- Github user lei-xia closed the pull request at: https://github.com/apache/helix/pull/18 > NPE while deleting a job from a recurrent job queue > --- > > Key: HELIX-578 > URL: https://issues.apache.org/jira/browse/HELIX-578 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek >Assignee: Lei Xia >Priority: Critical > > Helix throws an NPE when we try to delete a job from recurrent job queue. > Partial stacktrace: > java.lang.NullPointerException > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:295) > Helix is looking for workflow context's current state. > WorkflowContext wCtx = TaskUtil.getWorkflowContext(_propertyStore, queueName); > String workflowState = > (wCtx != null) ? wCtx.getWorkflowState().name() : > TaskState.NOT_STARTED.name(); > But for a recurring workflow, there is no "state" in the parent workflow's > context. Only the scheduled workflows will have a "state". Hence the NPE. > To ensure that queue is stopped, Helix should look at the context of > last-scheduled-workflow instead of the parent workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-578) NPE while deleting a job from a recurrent job queue
[ https://issues.apache.org/jira/browse/HELIX-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370123#comment-14370123 ] ASF GitHub Bot commented on HELIX-578: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/19 [HELIX-578] NPE while deleting a job from a recurrent job queue. Updated one with fix as Zhen Zhang suggested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 commit 49ceac0e9449940546e37628780e5098ac4e8678 Author: Lei Xia Date: 2015-03-19T20:54:05Z [HELIX-578] NPE while deleting a job from a recurrent job queue. > NPE while deleting a job from a recurrent job queue > --- > > Key: HELIX-578 > URL: https://issues.apache.org/jira/browse/HELIX-578 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek >Assignee: Lei Xia >Priority: Critical > > Helix throws an NPE when we try to delete a job from recurrent job queue. > Partial stacktrace: > java.lang.NullPointerException > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:295) > Helix is looking for workflow context's current state. > WorkflowContext wCtx = TaskUtil.getWorkflowContext(_propertyStore, queueName); > String workflowState = > (wCtx != null) ? wCtx.getWorkflowState().name() : > TaskState.NOT_STARTED.name(); > But for a recurring workflow, there is no "state" in the parent workflow's > context. Only the scheduled workflows will have a "state". Hence the NPE. > To ensure that queue is stopped, Helix should look at the context of > last-scheduled-workflow instead of the parent workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-578) NPE while deleting a job from a recurrent job queue
[ https://issues.apache.org/jira/browse/HELIX-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370663#comment-14370663 ] ASF GitHub Bot commented on HELIX-578: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/19 > NPE while deleting a job from a recurrent job queue > --- > > Key: HELIX-578 > URL: https://issues.apache.org/jira/browse/HELIX-578 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek >Assignee: Lei Xia >Priority: Critical > > Helix throws an NPE when we try to delete a job from recurrent job queue. > Partial stacktrace: > java.lang.NullPointerException > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:295) > Helix is looking for workflow context's current state. > WorkflowContext wCtx = TaskUtil.getWorkflowContext(_propertyStore, queueName); > String workflowState = > (wCtx != null) ? wCtx.getWorkflowState().name() : > TaskState.NOT_STARTED.name(); > But for a recurring workflow, there is no "state" in the parent workflow's > context. Only the scheduled workflows will have a "state". Hence the NPE. > To ensure that queue is stopped, Helix should look at the context of > last-scheduled-workflow instead of the parent workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions
[ https://issues.apache.org/jira/browse/HELIX-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372283#comment-14372283 ] ASF GitHub Bot commented on HELIX-584: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/20 [HELIX-584] SimpleDateFormat should not be used as singleton due to its race conditions You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/20.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20 commit e924a4c4ee1f1c52dcf6b478bbc88d3050e9d0f8 Author: Lei Xia Date: 2015-03-20T22:55:24Z [HELIX-584] SimpleDateFormat should not be used as singleton due to its race conditions. > SimpleDateFormat should not be used as singleton due to its race conditions > --- > > Key: HELIX-584 > URL: https://issues.apache.org/jira/browse/HELIX-584 > Project: Apache Helix > Issue Type: Bug >Reporter: Lei Xia >Assignee: Lei Xia > > SimpleDateFormat is used in workflowConfig as a singleton. But since it is > not thread-safe (refer here: > http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html), > it will mess up the output date format sometime due to race condition. > An example trace stack for such failure: > Message: > For input string: "2003.E2003E22" > Full Stacktrace: > java.lang.NumberFormatException: For input string: "2003.E2003E22" > at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) > at java.lang.Double.parseDouble(Double.java:510) > at java.text.DigitList.getDouble(DigitList.java:151) > at java.text.DecimalFormat.parse(DecimalFormat.java:1302) > at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935) > at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311) > at java.text.DateFormat.parse(DateFormat.java:335) > at > org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365) > at > org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126) > at > org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61) > at > org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76) > at org.testng.internal.Invoker.invokeMethod(Invoker.java:673) > at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846) > at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170) > at > org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) > at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) > at org.testng.TestRunner.runWorkers(TestRunner.java:1147) > at org.testng.TestRunner.privateRun(TestRunner.java:749) > at org.testng.TestRunner.run(TestRunner.java:600) > at org.testng.SuiteRunner.runTest(SuiteRunner.java:317) > at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312) > at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274) > at org.testng.SuiteRunner.run(SuiteRunner.java:223) > at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) > at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) > at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039) > at org.testng.TestNG.runSuitesLocally(TestNG.java:964) > at org.testng.TestNG.run(TestNG.java:900) > at > org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178) > at > org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92) > at > org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.jav
[jira] [Commented] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions
[ https://issues.apache.org/jira/browse/HELIX-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372320#comment-14372320 ] ASF GitHub Bot commented on HELIX-584: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/20 > SimpleDateFormat should not be used as singleton due to its race conditions > --- > > Key: HELIX-584 > URL: https://issues.apache.org/jira/browse/HELIX-584 > Project: Apache Helix > Issue Type: Bug >Reporter: Lei Xia >Assignee: Lei Xia > > SimpleDateFormat is used in workflowConfig as a singleton. But since it is > not thread-safe (refer here: > http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html), > it will mess up the output date format sometime due to race condition. > An example trace stack for such failure: > Message: > For input string: "2003.E2003E22" > Full Stacktrace: > java.lang.NumberFormatException: For input string: "2003.E2003E22" > at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) > at java.lang.Double.parseDouble(Double.java:510) > at java.text.DigitList.getDouble(DigitList.java:151) > at java.text.DecimalFormat.parse(DecimalFormat.java:1302) > at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935) > at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311) > at java.text.DateFormat.parse(DateFormat.java:335) > at > org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365) > at > org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126) > at > org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61) > at > org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76) > at org.testng.internal.Invoker.invokeMethod(Invoker.java:673) > at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846) > at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170) > at > org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) > at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) > at org.testng.TestRunner.runWorkers(TestRunner.java:1147) > at org.testng.TestRunner.privateRun(TestRunner.java:749) > at org.testng.TestRunner.run(TestRunner.java:600) > at org.testng.SuiteRunner.runTest(SuiteRunner.java:317) > at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312) > at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274) > at org.testng.SuiteRunner.run(SuiteRunner.java:223) > at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) > at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) > at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039) > at org.testng.TestNG.runSuitesLocally(TestNG.java:964) > at org.testng.TestNG.run(TestNG.java:900) > at > org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178) > at > org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92) > at > org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208) > at > org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158) > at > org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-589) Delete job API throws NPE if the job does not exist in last scheduled workflow
[ https://issues.apache.org/jira/browse/HELIX-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393628#comment-14393628 ] ASF GitHub Bot commented on HELIX-589: -- GitHub user jicongrui opened a pull request: https://github.com/apache/helix/pull/22 [HELIX-589] Delete job API throws NPE if the job does not exist in last ... [HELIX-589] Delete job API throws NPE if the job does not exist in last scheduled workflow You can merge this pull request into a Git repository by running: $ git pull https://github.com/jicongrui/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/22.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22 commit e94a9f5f90099a248181d6dc50314aec0e8d9512 Author: Congrui Ji Date: 2015-04-02T22:40:09Z [HELIX-589] Delete job API throws NPE if the job does not exist in last scheduled workflow > Delete job API throws NPE if the job does not exist in last scheduled workflow > -- > > Key: HELIX-589 > URL: https://issues.apache.org/jira/browse/HELIX-589 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek > > When trying to delete a job from a recurrent job queue, Helix throws NPE if > the job does not exist in last scheduled workflow. > java.lang.IllegalArgumentException: Could not remove job BackupJob_MailboxDB > from DAG of queue BackupJobQueue_ESPRESSO_CHO_1_SCHEDULED_0 > at > org.apache.helix.task.TaskDriver.removeJobFromDag(TaskDriver.java:411) > at > org.apache.helix.task.TaskDriver.deleteJobFromScheduledQueue(TaskDriver.java:345) > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:303) > It is possible for a user to add and immediately delete the job before the > next workflow is scheduled. Helix should accommodate this case and check if > the job exists in last scheduled workflow before trying to delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-589) Delete job API throws NPE if the job does not exist in last scheduled workflow
[ https://issues.apache.org/jira/browse/HELIX-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396380#comment-14396380 ] ASF GitHub Bot commented on HELIX-589: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/22 > Delete job API throws NPE if the job does not exist in last scheduled workflow > -- > > Key: HELIX-589 > URL: https://issues.apache.org/jira/browse/HELIX-589 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek > > When trying to delete a job from a recurrent job queue, Helix throws NPE if > the job does not exist in last scheduled workflow. > java.lang.IllegalArgumentException: Could not remove job BackupJob_MailboxDB > from DAG of queue BackupJobQueue_ESPRESSO_CHO_1_SCHEDULED_0 > at > org.apache.helix.task.TaskDriver.removeJobFromDag(TaskDriver.java:411) > at > org.apache.helix.task.TaskDriver.deleteJobFromScheduledQueue(TaskDriver.java:345) > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:303) > It is possible for a user to add and immediately delete the job before the > next workflow is scheduled. Helix should accommodate this case and check if > the job exists in last scheduled workflow before trying to delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-589) Delete job API throws NPE if the job does not exist in last scheduled workflow
[ https://issues.apache.org/jira/browse/HELIX-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396383#comment-14396383 ] ASF GitHub Bot commented on HELIX-589: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/22#issuecomment-89847798 Merged, thanks. > Delete job API throws NPE if the job does not exist in last scheduled workflow > -- > > Key: HELIX-589 > URL: https://issues.apache.org/jira/browse/HELIX-589 > Project: Apache Helix > Issue Type: Bug >Reporter: Karthiek > > When trying to delete a job from a recurrent job queue, Helix throws NPE if > the job does not exist in last scheduled workflow. > java.lang.IllegalArgumentException: Could not remove job BackupJob_MailboxDB > from DAG of queue BackupJobQueue_ESPRESSO_CHO_1_SCHEDULED_0 > at > org.apache.helix.task.TaskDriver.removeJobFromDag(TaskDriver.java:411) > at > org.apache.helix.task.TaskDriver.deleteJobFromScheduledQueue(TaskDriver.java:345) > at org.apache.helix.task.TaskDriver.deleteJob(TaskDriver.java:303) > It is possible for a user to add and immediately delete the job before the > next workflow is scheduled. Helix should accommodate this case and check if > the job exists in last scheduled workflow before trying to delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions
[ https://issues.apache.org/jira/browse/HELIX-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513347#comment-14513347 ] ASF GitHub Bot commented on HELIX-584: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/24 [HELIX-584] SimpleDateFormat should not be used as singleton due to its race conditions [HELIX-584] SimpleDateFormat should not be used as singleton due to its race conditions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/24.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #24 commit a29ac9fb35365d1fe8cf12ef95c58643a5fea36b Author: Lei Xia Date: 2015-04-27T00:25:33Z [HELIX-584] SimpleDateFormat should not be used as singleton due to its race conditions. > SimpleDateFormat should not be used as singleton due to its race conditions > --- > > Key: HELIX-584 > URL: https://issues.apache.org/jira/browse/HELIX-584 > Project: Apache Helix > Issue Type: Bug >Reporter: Lei Xia >Assignee: Lei Xia > > SimpleDateFormat is used in workflowConfig as a singleton. But since it is > not thread-safe (refer here: > http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html), > it will mess up the output date format sometime due to race condition. > An example trace stack for such failure: > Message: > For input string: "2003.E2003E22" > Full Stacktrace: > java.lang.NumberFormatException: For input string: "2003.E2003E22" > at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) > at java.lang.Double.parseDouble(Double.java:510) > at java.text.DigitList.getDouble(DigitList.java:151) > at java.text.DecimalFormat.parse(DecimalFormat.java:1302) > at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935) > at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311) > at java.text.DateFormat.parse(DateFormat.java:335) > at > org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365) > at > org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126) > at > org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61) > at > org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76) > at org.testng.internal.Invoker.invokeMethod(Invoker.java:673) > at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846) > at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170) > at > org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) > at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) > at org.testng.TestRunner.runWorkers(TestRunner.java:1147) > at org.testng.TestRunner.privateRun(TestRunner.java:749) > at org.testng.TestRunner.run(TestRunner.java:600) > at org.testng.SuiteRunner.runTest(SuiteRunner.java:317) > at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312) > at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274) > at org.testng.SuiteRunner.run(SuiteRunner.java:223) > at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) > at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) > at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039) > at org.testng.TestNG.runSuitesLocally(TestNG.java:964) > at org.testng.TestNG.run(TestNG.java:900) > at > org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178) > at > org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92) > at > org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.a
[jira] [Commented] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions
[ https://issues.apache.org/jira/browse/HELIX-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513557#comment-14513557 ] ASF GitHub Bot commented on HELIX-584: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/24#issuecomment-96513218 thanks! > SimpleDateFormat should not be used as singleton due to its race conditions > --- > > Key: HELIX-584 > URL: https://issues.apache.org/jira/browse/HELIX-584 > Project: Apache Helix > Issue Type: Bug >Reporter: Lei Xia >Assignee: Lei Xia > > SimpleDateFormat is used in workflowConfig as a singleton. But since it is > not thread-safe (refer here: > http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html), > it will mess up the output date format sometime due to race condition. > An example trace stack for such failure: > Message: > For input string: "2003.E2003E22" > Full Stacktrace: > java.lang.NumberFormatException: For input string: "2003.E2003E22" > at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) > at java.lang.Double.parseDouble(Double.java:510) > at java.text.DigitList.getDouble(DigitList.java:151) > at java.text.DecimalFormat.parse(DecimalFormat.java:1302) > at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935) > at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311) > at java.text.DateFormat.parse(DateFormat.java:335) > at > org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365) > at > org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126) > at > org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61) > at > org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76) > at org.testng.internal.Invoker.invokeMethod(Invoker.java:673) > at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846) > at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170) > at > org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) > at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) > at org.testng.TestRunner.runWorkers(TestRunner.java:1147) > at org.testng.TestRunner.privateRun(TestRunner.java:749) > at org.testng.TestRunner.run(TestRunner.java:600) > at org.testng.SuiteRunner.runTest(SuiteRunner.java:317) > at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312) > at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274) > at org.testng.SuiteRunner.run(SuiteRunner.java:223) > at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) > at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) > at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039) > at org.testng.TestNG.runSuitesLocally(TestNG.java:964) > at org.testng.TestNG.run(TestNG.java:900) > at > org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178) > at > org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92) > at > org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208) > at > org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158) > at > org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions
[ https://issues.apache.org/jira/browse/HELIX-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513556#comment-14513556 ] ASF GitHub Bot commented on HELIX-584: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/24 > SimpleDateFormat should not be used as singleton due to its race conditions > --- > > Key: HELIX-584 > URL: https://issues.apache.org/jira/browse/HELIX-584 > Project: Apache Helix > Issue Type: Bug >Reporter: Lei Xia >Assignee: Lei Xia > > SimpleDateFormat is used in workflowConfig as a singleton. But since it is > not thread-safe (refer here: > http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html), > it will mess up the output date format sometime due to race condition. > An example trace stack for such failure: > Message: > For input string: "2003.E2003E22" > Full Stacktrace: > java.lang.NumberFormatException: For input string: "2003.E2003E22" > at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) > at java.lang.Double.parseDouble(Double.java:510) > at java.text.DigitList.getDouble(DigitList.java:151) > at java.text.DecimalFormat.parse(DecimalFormat.java:1302) > at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935) > at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311) > at java.text.DateFormat.parse(DateFormat.java:335) > at > org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365) > at > org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113) > at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126) > at > org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61) > at > org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76) > at org.testng.internal.Invoker.invokeMethod(Invoker.java:673) > at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846) > at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170) > at > org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) > at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) > at org.testng.TestRunner.runWorkers(TestRunner.java:1147) > at org.testng.TestRunner.privateRun(TestRunner.java:749) > at org.testng.TestRunner.run(TestRunner.java:600) > at org.testng.SuiteRunner.runTest(SuiteRunner.java:317) > at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312) > at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274) > at org.testng.SuiteRunner.run(SuiteRunner.java:223) > at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) > at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) > at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039) > at org.testng.TestNG.runSuitesLocally(TestNG.java:964) > at org.testng.TestNG.run(TestNG.java:900) > at > org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178) > at > org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92) > at > org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208) > at > org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158) > at > org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-591) Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a group tag
[ https://issues.apache.org/jira/browse/HELIX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515148#comment-14515148 ] ASF GitHub Bot commented on HELIX-591: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/25 [HELIX-591] Provide getResourcesWithTag in HelixAdmin to retrieve all all resources with a group tag. [HELIX-591] Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a group tag. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/25.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #25 commit 8a279a366d4e6c43366a1c5867a02f26768e5627 Author: Lei Xia Date: 2015-04-27T20:36:37Z [HELIX-591] Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a group tag. > Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a > group tag > > > Key: HELIX-591 > URL: https://issues.apache.org/jira/browse/HELIX-591 > Project: Apache Helix > Issue Type: Bug > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > > We need to retrieve resources with a given group tag. It is better for Helix > Admin to Provide getResourcesWithTag in HelixAdmin to retrieve all resources > with a given tag -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520496#comment-14520496 ] ASF GitHub Bot commented on HELIX-592: -- GitHub user jicongrui opened a pull request: https://github.com/apache/helix/pull/26 [HELIX-592] addCluster should respect overwriteExisitng when adding stat... ...eModelDefinations There are some tests expecting exceptions when creating an existing cluster and I change the result. So the question is that the business logic of creating a exisiting cluster. If we allow that and overwrite is false, should we throw exceptions or do nothing? You can merge this pull request into a Git repository by running: $ git pull https://github.com/jicongrui/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/26.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #26 commit 69cd1f27065710f6de157b82742673ab8baf5d11 Author: Congrui Ji Date: 2015-04-29T23:13:11Z [HELIX-592] addCluster should respect overwriteExisitng when adding stateModelDefinations There are some tests expecting exceptions when creating an existing cluster and I change the result. So the question is that the business logic of creating a exisiting cluster. If we allow that and overwrite is false, should we throw exceptions or do nothing? > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525454#comment-14525454 ] ASF GitHub Bot commented on HELIX-592: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/26#discussion_r29550320 --- Diff: helix-core/src/main/java/org/apache/helix/tools/ClusterSetup.java --- @@ -329,8 +329,9 @@ public HelixAdmin getClusterManagementTool() { return _admin; } - public void addStateModelDef(String clusterName, String stateModelDef, StateModelDefinition record) { --- End diff -- Please do not remove public methods. Instead, call the new method from the old one with a default value. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-591) Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a group tag
[ https://issues.apache.org/jira/browse/HELIX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525452#comment-14525452 ] ASF GitHub Bot commented on HELIX-591: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/25 > Provide getResourcesWithTag in HelixAdmin to retrieve all resources with a > group tag > > > Key: HELIX-591 > URL: https://issues.apache.org/jira/browse/HELIX-591 > Project: Apache Helix > Issue Type: Bug > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > > We need to retrieve resources with a given group tag. It is better for Helix > Admin to Provide getResourcesWithTag in HelixAdmin to retrieve all resources > with a given tag -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538450#comment-14538450 ] ASF GitHub Bot commented on HELIX-592: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/27 [HELIX-592] addCluster should respect overwriteExisitng when adding stateModel Definations. Congrui Ji has sent this fix before, and got some comments. But he is on vacation, and we (LinkedIn) do need the fix ASAP, so I am sending this again. Thanks You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/27.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #27 commit 57a5324034b689c8ae1fab855b75e7fc7e7517ef Author: Lei Xia Date: 2015-05-11T18:35:20Z [HELIX-592] addCluster should respect overwriteExisitng when adding stateModel Definations. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539227#comment-14539227 ] ASF GitHub Bot commented on HELIX-592: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/27#discussion_r30104417 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java --- @@ -715,14 +715,26 @@ public ExternalView getResourceExternalView(String clusterName, String resourceN @Override public void addStateModelDef(String clusterName, String stateModelDef, StateModelDefinition stateModel) { +addStateModelDef(clusterName, stateModelDef, stateModel, false); + } + + @Override + public void addStateModelDef(String clusterName, String stateModelDef, + StateModelDefinition stateModel, boolean recreateIfExists) { if (!ZKUtil.isClusterSetup(clusterName, _zkClient)) { throw new HelixException("cluster " + clusterName + " is not setup yet"); } String stateModelDefPath = HelixUtil.getStateModelDefinitionPath(clusterName); String stateModelPath = stateModelDefPath + "/" + stateModelDef; if (_zkClient.exists(stateModelPath)) { - logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); - throw new HelixException("State model path " + stateModelPath + " already exists."); + if (recreateIfExists) { +logger.warn("Operation.State Model directory exists:" + stateModelPath + +", remove and recreate."); +_zkClient.deleteRecursive(stateModelPath); + } else { +logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); +return; --- End diff -- Why was this changed to no longer throw an exception? It's better to fail loudly in methods that return `void`. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539230#comment-14539230 ] ASF GitHub Bot commented on HELIX-592: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/27#discussion_r30104425 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java --- @@ -715,14 +715,26 @@ public ExternalView getResourceExternalView(String clusterName, String resourceN @Override public void addStateModelDef(String clusterName, String stateModelDef, StateModelDefinition stateModel) { +addStateModelDef(clusterName, stateModelDef, stateModel, false); + } + + @Override + public void addStateModelDef(String clusterName, String stateModelDef, + StateModelDefinition stateModel, boolean recreateIfExists) { if (!ZKUtil.isClusterSetup(clusterName, _zkClient)) { throw new HelixException("cluster " + clusterName + " is not setup yet"); } String stateModelDefPath = HelixUtil.getStateModelDefinitionPath(clusterName); String stateModelPath = stateModelDefPath + "/" + stateModelDef; if (_zkClient.exists(stateModelPath)) { - logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); - throw new HelixException("State model path " + stateModelPath + " already exists."); + if (recreateIfExists) { +logger.warn("Operation.State Model directory exists:" + stateModelPath + --- End diff -- Change to `info` log level. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540673#comment-14540673 ] ASF GitHub Bot commented on HELIX-592: -- Github user lei-xia commented on a diff in the pull request: https://github.com/apache/helix/pull/27#discussion_r30178346 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java --- @@ -715,14 +715,26 @@ public ExternalView getResourceExternalView(String clusterName, String resourceN @Override public void addStateModelDef(String clusterName, String stateModelDef, StateModelDefinition stateModel) { +addStateModelDef(clusterName, stateModelDef, stateModel, false); + } + + @Override + public void addStateModelDef(String clusterName, String stateModelDef, + StateModelDefinition stateModel, boolean recreateIfExists) { if (!ZKUtil.isClusterSetup(clusterName, _zkClient)) { throw new HelixException("cluster " + clusterName + " is not setup yet"); } String stateModelDefPath = HelixUtil.getStateModelDefinitionPath(clusterName); String stateModelPath = stateModelDefPath + "/" + stateModelDef; if (_zkClient.exists(stateModelPath)) { - logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); - throw new HelixException("State model path " + stateModelPath + " already exists."); + if (recreateIfExists) { +logger.warn("Operation.State Model directory exists:" + stateModelPath + +", remove and recreate."); +_zkClient.deleteRecursive(stateModelPath); + } else { +logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); +return; --- End diff -- This is to align with the behavior of addCluster, which return success if the cluster exists and overwrite flag is false. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540675#comment-14540675 ] ASF GitHub Bot commented on HELIX-592: -- Github user lei-xia commented on a diff in the pull request: https://github.com/apache/helix/pull/27#discussion_r30178394 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java --- @@ -715,14 +715,26 @@ public ExternalView getResourceExternalView(String clusterName, String resourceN @Override public void addStateModelDef(String clusterName, String stateModelDef, StateModelDefinition stateModel) { +addStateModelDef(clusterName, stateModelDef, stateModel, false); + } + + @Override + public void addStateModelDef(String clusterName, String stateModelDef, + StateModelDefinition stateModel, boolean recreateIfExists) { if (!ZKUtil.isClusterSetup(clusterName, _zkClient)) { throw new HelixException("cluster " + clusterName + " is not setup yet"); } String stateModelDefPath = HelixUtil.getStateModelDefinitionPath(clusterName); String stateModelPath = stateModelDefPath + "/" + stateModelDef; if (_zkClient.exists(stateModelPath)) { - logger.warn("Skip the operation.State Model directory exists:" + stateModelPath); - throw new HelixException("State model path " + stateModelPath + " already exists."); + if (recreateIfExists) { +logger.warn("Operation.State Model directory exists:" + stateModelPath + --- End diff -- Fixed in new diff. Thanks! > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541244#comment-14541244 ] ASF GitHub Bot commented on HELIX-592: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/27 > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-592) addCluster should respect overwriteExisitng when adding stateModelDefinations
[ https://issues.apache.org/jira/browse/HELIX-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541246#comment-14541246 ] ASF GitHub Bot commented on HELIX-592: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/27#issuecomment-101492973 This closes #26. > addCluster should respect overwriteExisitng when adding stateModelDefinations > - > > Key: HELIX-592 > URL: https://issues.apache.org/jira/browse/HELIX-592 > Project: Apache Helix > Issue Type: Bug >Reporter: Congrui Ji > > Currently addCluster in clusterSetup.java ignores the overwriteExisitng > parameter while adding stateModelDefinations. This causes exception > -StateModelDef already exist. please help fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-596) Message throttling of controller behavior unexpectedly, throttled messages still take the constraint quota
[ https://issues.apache.org/jira/browse/HELIX-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553697#comment-14553697 ] ASF GitHub Bot commented on HELIX-596: -- GitHub user hangqi opened a pull request: https://github.com/apache/helix/pull/28 [HELIX-596] fix throttled messages still take constraints' quota Corresponding review request: https://reviews.apache.org/r/34345/ Main changes in this pull request: perMessageThrottleQuotaMap records all matched constraints quota for this message, and update the overall throttleMap iff the message has not been throttled. Originally not matter the message will be sent out or not, it will always take the quota of the matched constraints. @zzhang5 You can merge this pull request into a Git repository by running: $ git pull https://github.com/hangqi/helix fix_constrain_quota Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/28.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #28 commit 9ddbefcacff6b8e229e6413299d53d89f1cbcd43 Author: Hang Qi Date: 2015-05-18T06:06:20Z [HELIX-596] fix throttled messages still take constraints' quota > Message throttling of controller behavior unexpectedly, throttled messages > still take the constraint quota > -- > > Key: HELIX-596 > URL: https://issues.apache.org/jira/browse/HELIX-596 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hang Qi > Fix For: master > > > We found a very strange behavior on message throttling of controller when > there is multiple constraints. Here is our setup ( we are using helix-0.6.4, > only one resource ) > - constraint 1: per node constraint, we only allow 3 state transitions > happens on one node concurrently. > - constraint 2: per partition constraint, we define the state transition > priorities in the state model, and only allow one state transition happens on > one single partition concurrently. > We are using MasterSlave state model, suppose we have two nodes A, B, each > has 8 partitions (p0-p7) respectively, and initially both A and B are > shutdown, and now we start them at the same time (say A is slightly earlier > than B). > The expected behavior might be > - p0, p1, p2 on A starts from Offline -> Slave; p3, p4, p5 on B starts from > Offline -> Slave > But the real result is: > - p0, p1, p2 on A starts from Offline -> Slave, nothing happens on B > - until p0, p1, p2 all transited to Master state, p3, p4, p5 on A starts > from Offline -> Slave; p0, p1, p2 on B starts from Offline -> Slave > As step Offline -> Slave might take long time, this behavior result in very > long time to bring up these two nodes (long down time result in long catch up > time as well), though ideally we should not let both nodes down at the same > time. > Looked at the controller code, I like the stage and pipeline based > implementation, it is well design, very easy to understand and to reason > about. > The logic of MessageThrottleStage#throttle, > - it goes through each messages selected by MessageSelectionStage, > - for each message, it goes through all selected matched constraints, and > decrease the quota of each constraints > - if any constraint's quota is less than 0, this message will be marked > as throttled. > > I think there is something wrong here, the message will take the quota of > constraints even it is not going to be sent out (throttled). That explains > our case, > - all the messages have been generated by the beginning, (p0, A, > Offline->Slave), ... (p7, A, Offline->Slave), (p0, B, Offline->Slave), ..., > (p7, B, Offline->Slave) > - in the messageThrottleStage#throttle > - (p0, A, Offline->Slave), (p1, A, Offline->Slave), (p2, A, > Offline->Slave) are good, and constraint 1 on A reaches 0, constraint 2 on > p0, p1, p2 reaches 0 as well > - (p3, A, Offline->Slave), ... (p7, A, Offline->Slave) throttled by > constraint 1 on A, also takes the quota of constraint 2 on those partitions > as well. > - (p0, B, Offline->Slave), ... (p7, B, Offline->Slave) throttled by > constraint 2 > - thus only (p0, A, Offline->Slave), (p1, A, Oflline->Slave), (p2, A, > Offline->Slave) has been sent out by controller. > Does that make sense, or is there anything else you can think of to result in > this unexpected behavior? And is there any work around for it? One thing > comes into my mind is update constraint 2 to be only one state transition is > allowed of single partition on c
[jira] [Commented] (HELIX-596) Message throttling of controller behavior unexpectedly, throttled messages still take the constraint quota
[ https://issues.apache.org/jira/browse/HELIX-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554866#comment-14554866 ] ASF GitHub Bot commented on HELIX-596: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/28 > Message throttling of controller behavior unexpectedly, throttled messages > still take the constraint quota > -- > > Key: HELIX-596 > URL: https://issues.apache.org/jira/browse/HELIX-596 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.6.4 >Reporter: Hang Qi > Fix For: master > > > We found a very strange behavior on message throttling of controller when > there is multiple constraints. Here is our setup ( we are using helix-0.6.4, > only one resource ) > - constraint 1: per node constraint, we only allow 3 state transitions > happens on one node concurrently. > - constraint 2: per partition constraint, we define the state transition > priorities in the state model, and only allow one state transition happens on > one single partition concurrently. > We are using MasterSlave state model, suppose we have two nodes A, B, each > has 8 partitions (p0-p7) respectively, and initially both A and B are > shutdown, and now we start them at the same time (say A is slightly earlier > than B). > The expected behavior might be > - p0, p1, p2 on A starts from Offline -> Slave; p3, p4, p5 on B starts from > Offline -> Slave > But the real result is: > - p0, p1, p2 on A starts from Offline -> Slave, nothing happens on B > - until p0, p1, p2 all transited to Master state, p3, p4, p5 on A starts > from Offline -> Slave; p0, p1, p2 on B starts from Offline -> Slave > As step Offline -> Slave might take long time, this behavior result in very > long time to bring up these two nodes (long down time result in long catch up > time as well), though ideally we should not let both nodes down at the same > time. > Looked at the controller code, I like the stage and pipeline based > implementation, it is well design, very easy to understand and to reason > about. > The logic of MessageThrottleStage#throttle, > - it goes through each messages selected by MessageSelectionStage, > - for each message, it goes through all selected matched constraints, and > decrease the quota of each constraints > - if any constraint's quota is less than 0, this message will be marked > as throttled. > > I think there is something wrong here, the message will take the quota of > constraints even it is not going to be sent out (throttled). That explains > our case, > - all the messages have been generated by the beginning, (p0, A, > Offline->Slave), ... (p7, A, Offline->Slave), (p0, B, Offline->Slave), ..., > (p7, B, Offline->Slave) > - in the messageThrottleStage#throttle > - (p0, A, Offline->Slave), (p1, A, Offline->Slave), (p2, A, > Offline->Slave) are good, and constraint 1 on A reaches 0, constraint 2 on > p0, p1, p2 reaches 0 as well > - (p3, A, Offline->Slave), ... (p7, A, Offline->Slave) throttled by > constraint 1 on A, also takes the quota of constraint 2 on those partitions > as well. > - (p0, B, Offline->Slave), ... (p7, B, Offline->Slave) throttled by > constraint 2 > - thus only (p0, A, Offline->Slave), (p1, A, Oflline->Slave), (p2, A, > Offline->Slave) has been sent out by controller. > Does that make sense, or is there anything else you can think of to result in > this unexpected behavior? And is there any work around for it? One thing > comes into my mind is update constraint 2 to be only one state transition is > allowed of single partition on certain state transitions. > Thanks very much. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-600) Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp
[ https://issues.apache.org/jira/browse/HELIX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579731#comment-14579731 ] ASF GitHub Bot commented on HELIX-600: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/29 [HELIX-600] Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp. Ticket: https://issues.apache.org/jira/browse/HELIX-600 mvn test passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/29.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #29 commit a84c9b1f55cb5c01f2c39fb437c6d4effcee3874 Author: Lei Xia Date: 2015-06-09T21:40:32Z [HELIX-600] Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp. > Task scheduler fails to schedule a recurring workflow if the startTime is set > to a future timestamp > --- > > Key: HELIX-600 > URL: https://issues.apache.org/jira/browse/HELIX-600 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3, 0.6.4 >Reporter: Karthiek >Assignee: Lei Xia > > If we define a recurrent job queue with start-time value in the future (say > current time + 5 minutes), Helix does not schedule the queue event after > start-time timestamp elapses. Helix should schedule jobs once the recurrence > timestamp is hit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593786#comment-14593786 ] ASF GitHub Bot commented on HELIX-601: -- GitHub user jicongrui opened a pull request: https://github.com/apache/helix/pull/30 [HELIX-601] Allow work flow to schedule dependency jobs in parallel Currently, Helix won't schedule dependency jobs in a same work flow. For example, if Job2 depends on Job1, Job2 won't be scheduled until every partition of Job1 is completed. However, if some participant is very slow, then all dependency jobs is waiting for that single participant. Helix should be able to schedule multiple jobs according to a parameter. A.C. 1. Introduce parallel count parameter in work flow and job queue. 2. Dependency jobs can be scheduled according to the parameter (Now the parameter is always 1, so no parallel) 3. If Job2 depends on Job1, Job1 is scheduled before Job2. 4. No parallel jobs on the same instance. If a instance is running Job1, it won't run Job2 until Job1 is finished. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jicongrui/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/30.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #30 commit 8819220738b18c54652e4b32b9677ea78d585da2 Author: Congrui Ji Date: 2015-06-19T18:51:19Z [HELIX-601] Allow work flow to schedule dependency jobs in parallel Currently, Helix won't schedule dependency jobs in a same work flow. For example, if Job2 depends on Job1, Job2 won't be scheduled until every partition of Job1 is completed. However, if some participant is very slow, then all dependency jobs is waiting for that single participant. Helix should be able to schedule multiple jobs according to a parameter. A.C. 1. Introduce parallel count parameter in work flow and job queue. 2. Dependency jobs can be scheduled according to the parameter (Now the parameter is always 1, so no parallel) 3. If Job2 depends on Job1, Job1 is scheduled before Job2. 4. No parallel jobs on the same instance. If a instance is running Job1, it won't run Job2 until Job1 is finished. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-600) Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp
[ https://issues.apache.org/jira/browse/HELIX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595340#comment-14595340 ] ASF GitHub Bot commented on HELIX-600: -- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/29 > Task scheduler fails to schedule a recurring workflow if the startTime is set > to a future timestamp > --- > > Key: HELIX-600 > URL: https://issues.apache.org/jira/browse/HELIX-600 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3, 0.6.4 >Reporter: Karthiek >Assignee: Lei Xia > > If we define a recurrent job queue with start-time value in the future (say > current time + 5 minutes), Helix does not schedule the queue event after > start-time timestamp elapses. Helix should schedule jobs once the recurrence > timestamp is hit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-600) Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp
[ https://issues.apache.org/jira/browse/HELIX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595339#comment-14595339 ] ASF GitHub Bot commented on HELIX-600: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/29#issuecomment-113996929 Merged -- thanks! > Task scheduler fails to schedule a recurring workflow if the startTime is set > to a future timestamp > --- > > Key: HELIX-600 > URL: https://issues.apache.org/jira/browse/HELIX-600 > Project: Apache Helix > Issue Type: Bug >Affects Versions: 0.6.3, 0.6.4 >Reporter: Karthiek >Assignee: Lei Xia > > If we define a recurrent job queue with start-time value in the future (say > current time + 5 minutes), Helix does not schedule the queue event after > start-time timestamp elapses. Helix should schedule jobs once the recurrence > timestamp is hit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595344#comment-14595344 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/30#discussion_r32904755 --- Diff: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerParallel.java --- @@ -0,0 +1,195 @@ +package org.apache.helix.integration.task; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.helix.AccessOption; +import org.apache.helix.HelixDataAccessor; +import org.apache.helix.HelixManager; +import org.apache.helix.HelixManagerFactory; +import org.apache.helix.InstanceType; +import org.apache.helix.PropertyKey; +import org.apache.helix.TestHelper; +import org.apache.helix.integration.ZkIntegrationTestBase; +import org.apache.helix.integration.manager.ClusterControllerManager; +import org.apache.helix.integration.manager.MockParticipantManager; +import org.apache.helix.participant.StateMachineEngine; +import org.apache.helix.task.JobConfig; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobQueue; +import org.apache.helix.task.Task; +import org.apache.helix.task.TaskCallbackContext; +import org.apache.helix.task.TaskConstants; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskFactory; +import org.apache.helix.task.TaskPartitionState; +import org.apache.helix.task.TaskResult; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.TaskStateModelFactory; +import org.apache.helix.task.TaskUtil; +import org.apache.helix.task.Workflow; +import org.apache.helix.tools.ClusterSetup; +import org.apache.helix.tools.ClusterStateVerifier; +import org.testng.Assert; +import org.testng.annotations.AfterClass; +import org.testng.annotations.BeforeClass; +import org.testng.annotations.Test; + +import com.google.common.base.Joiner; +import com.google.common.collect.ImmutableMap; + +public class TestTaskRebalancerParallel extends ZkIntegrationTestBase { --- End diff -- Apache license header is missing > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595347#comment-14595347 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/30#discussion_r32904781 --- Diff: helix-core/src/main/java/org/apache/helix/task/TaskRebalancer.java --- @@ -134,14 +134,22 @@ public ResourceAssignment computeBestPossiblePartitionState(ClusterDataCache clu workflowCtx.setStartTime(System.currentTimeMillis()); } -// Check parent dependencies -for (String parent : workflowCfg.getJobDag().getDirectParents(resourceName)) { - if (workflowCtx.getJobState(parent) == null - || !workflowCtx.getJobState(parent).equals(TaskState.COMPLETED)) { -return emptyAssignment(resourceName, currStateOutput); +// check ancestor job status +int unStartCount = 0; --- End diff -- Please rename to `notStartedCount` and `incompleteCount` > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595348#comment-14595348 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/30#discussion_r32904867 --- Diff: helix-core/src/main/java/org/apache/helix/task/TaskRebalancer.java --- @@ -219,6 +227,32 @@ public ResourceAssignment computeBestPossiblePartitionState(ClusterDataCache clu return newAssignment; } + private Set getWorkflowAssignedInstances(String currentJobName, --- End diff -- Method name should indicate that the returned value does not consider the current job. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595359#comment-14595359 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-113998619 This feels like a hack. If A depends on B, then A should never run before B. If it is acceptable for A and B to run in parallel, then A should not depend on B. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596247#comment-14596247 ] ASF GitHub Bot commented on HELIX-601: -- Github user jicongrui commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-114181050 This is kind of hacky, but helix has no better way to handle it. The request is something between totally out of order (workflow, no dependency) and totally order (jobDag, job2 can't run after job1). The request hope job1 is scheduled before job2, and job2 can be scheduled even if some participants get stuck on job1 > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596328#comment-14596328 ] ASF GitHub Bot commented on HELIX-601: -- Github user jicongrui commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-114198676 Updated the pull request by comments. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596405#comment-14596405 ] ASF GitHub Bot commented on HELIX-599: -- GitHub user lei-xia opened a pull request: https://github.com/apache/helix/pull/31 [HELIX-599] Support creating/maintaining/routing resources with same names in different instance groups. More details on the problems and our proposed solution is on the jira description: https://issues.apache.org/jira/browse/HELIX-599 You can merge this pull request into a Git repository by running: $ git pull https://github.com/lei-xia/helix helix-0.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/helix/pull/31.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #31 commit 2f88e070fb698c1420873c1bffa63640638de1ba Author: Lei Xia Date: 2015-05-11T17:54:27Z [HELIX-599] Support creating/maintaining/routing resources with same names in different instance groups. > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609556#comment-14609556 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-117433383 This should fail fast if you try to set parallelism on workflows that do not have a target resource. Also, what if the target resource has its partitions assigned to other instances? I'm not fully convinced that this is safe except in the case where the task has a target resource and that target resource is assigned to a fixed set of instances. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609577#comment-14609577 ] ASF GitHub Bot commented on HELIX-601: -- Github user jicongrui commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-117438877 The assignment follows the same logic as before, and it can be considered as a black box, whose input is job and output is task assignment. So this diff only checks the output, task assignment, and remove busy instances from task assignment. E.g. If the target resource change resource to a different set of instances, the task assignment would contain no busy instances, so job2 can be executed on any of the new instance. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-601) Allow work flow to schedule dependency jobs in parallel
[ https://issues.apache.org/jira/browse/HELIX-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613456#comment-14613456 ] ASF GitHub Bot commented on HELIX-601: -- Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/30#issuecomment-118425221 1. Let's say we have a target resource with 2 partitions and 0 replicas, with one partition assigned to node A, and one partition assigned to node B. Job 0 runs on nodes A and B, it finishes on node A, and then Job 1 starts on node A. Then imagine node B fails, and both partitions are now on node A. Job 1 is running on node A, but Job 0 did not finish for the partition that was reassigned to node A. We have a dependency inversion, and that's why this is unsafe. 2. If the job does not have a target resource, this change doesn't make sense. An exception should be thrown if you attempt to submit an untargeted workflow that has parallelism set. > Allow work flow to schedule dependency jobs in parallel > --- > > Key: HELIX-601 > URL: https://issues.apache.org/jira/browse/HELIX-601 > Project: Apache Helix > Issue Type: New Feature >Reporter: Congrui Ji > > Currently, Helix won't schedule dependency jobs in a same work flow. For > example, if Job2 depends on Job1, Job2 won't be scheduled until every > partition of Job1 is completed. > However, if some participant is very slow, then all dependency jobs is > waiting for that single participant. > Helix should be able to schedule multiple jobs according to a parameter. > A.C. > 1. Introduce parallel count parameter in work flow and job queue. > 2. Dependency jobs can be scheduled according to the parameter (Now the > parameter is always 1, so no parallel) > 3. If Job2 depends on Job1, Job1 is scheduled before Job2. > 4. No parallel jobs on the same instance. If a instance is running Job1, it > won't run Job2 until Job1 is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613459#comment-14613459 ] ASF GitHub Bot commented on HELIX-599: -- Github user kishoreg commented on a diff in the pull request: https://github.com/apache/helix/pull/31#discussion_r33881022 --- Diff: helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java --- @@ -127,7 +127,9 @@ public void process(ClusterEvent event) throws Exception { Message message = createMessage(manager, resourceName, partition.getPartitionName(), instanceName, currentState, nextState, sessionIdMap.get(instanceName), stateModelDef.getId(), -resource.getStateModelFactoryname(), bucketSize); --- End diff -- can we simply pass in the entire resource and createMessage can fetch required attributes from resource > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613460#comment-14613460 ] ASF GitHub Bot commented on HELIX-599: -- Github user kishoreg commented on a diff in the pull request: https://github.com/apache/helix/pull/31#discussion_r33881054 --- Diff: helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java --- @@ -190,7 +192,8 @@ public void process(ClusterEvent event) throws Exception { private Message createMessage(HelixManager manager, String resourceName, String partitionName, String instanceName, String currentState, String nextState, String sessionId, - String stateModelDefName, String stateModelFactoryName, int bucketSize) { --- End diff -- This method has too many parameters, we need to just pass in resource or have a message builder class > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613461#comment-14613461 ] ASF GitHub Bot commented on HELIX-599: -- Github user kishoreg commented on a diff in the pull request: https://github.com/apache/helix/pull/31#discussion_r33881099 --- Diff: helix-core/src/main/java/org/apache/helix/model/IdealState.java --- @@ -55,7 +55,9 @@ MAX_PARTITIONS_PER_INSTANCE, INSTANCE_GROUP_TAG, REBALANCER_CLASS_NAME, -HELIX_ENABLED +HELIX_ENABLED, +RESOURCE_GROUP_NAME, +RESOURCE_GROUP_ENABLED --- End diff -- Why do we need ResourceGroupEnabled flag? Will things work as expected if there is a resourcegroupName and by default we can set the resourceGroupName to resourceName ? > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613462#comment-14613462 ] ASF GitHub Bot commented on HELIX-599: -- Github user kishoreg commented on a diff in the pull request: https://github.com/apache/helix/pull/31#discussion_r33881174 --- Diff: helix-core/src/main/java/org/apache/helix/spectator/RoutingTableProvider.java --- @@ -73,6 +75,73 @@ public RoutingTableProvider() { } /** + * returns the instances for {resource,partition} pair that are in a specific {state} if + * aggregateGrouping is turned on, find all resources belongs to the given resourceGroupName and + * aggregate all partition states from all these resources. + * + * @param resourceName + * @param partitionName + * @param state + * @param groupingEnabled + * + * @return empty list if there is no instance in a given state + */ + public List getInstances(String resourceName, String partitionName, String state, --- End diff -- having boolean here does not make sense. we should problem have getInstancesForResource and getInstancesForResourceGroup > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups
[ https://issues.apache.org/jira/browse/HELIX-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613486#comment-14613486 ] ASF GitHub Bot commented on HELIX-599: -- Github user kishoreg commented on a diff in the pull request: https://github.com/apache/helix/pull/31#discussion_r33882177 --- Diff: helix-core/src/main/java/org/apache/helix/model/IdealState.java --- @@ -536,4 +574,16 @@ public boolean isEnabled() { public void enable(boolean enabled) { _record.setSimpleField(IdealStateProperty.HELIX_ENABLED.name(), Boolean.toString(enabled)); } + + /** + * Get the mangled IdealState name if resourceGroup is enable. + * + * @param resourceName + * @param resourceTag + * + * @return + */ + public static String getIdealStateName(String resourceName, String resourceTag) { --- End diff -- why do we need this method? This convention can be completely handled on client side rt? > Support creating/maintaining/routing resources with same names in different > instance groups > --- > > Key: HELIX-599 > URL: https://issues.apache.org/jira/browse/HELIX-599 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core, helix-webapp-admin >Reporter: Lei Xia >Assignee: Lei Xia > Original Estimate: 168h > Remaining Estimate: 168h > > In LinkedIn, we have a new use scenario that there will be multiple databases > sitting in the same Helix cluster with the same name, but on different > instance groups. What we need are: > 1) Allow resources (databases) with the same name, these resources are on > different instance groups (with different tags). > 2) Routing table (Spectator) is able to aggregate and return all instance > (from multiple instance groups) that hold the database with given name. > Our proposed solution is: > 1) Add a "Resource Group" field in IdealState for the databases with the > same names from different instance groups > 2) Use Instance Group Tag (or new "Resource Tag") to differentiate databases > (with same name) from different instance groups. > 3) Use name mangling for Idealstate, for example, with database TestDB in > instance group "testGroup", the IdealState and ExternalView id would be > "TestDB$testGroup". > 4) Change Helix Routing Table to be able to aggregate databases from the > same resource group. > > Four new APIs are going to be added to RoutingTableProvider: > public class RoutingTableProvider { > > /** > * returns the instances that contain the given partition in a specific state > from all resources with given resource name > */ > public List getInstances(String resource, String partition, > String state); > > /** > * returns the instances that contain the given partition in a specific state > from selected resources with given name and tags > */ > public List getInstances(String resource, String partition, > String state, List resourceTags); > > /** > * returns instances that contain given resource that are in a specific state > */ > public Set getInstances(String resource, String state); > > /** > * returns instances that contain given resource with tags that are in a > specific state > */ > public Set getInstances(String resource, String state, > List groupTags); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)