[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2203 @ptgoetz Please review the change in HeartSaVioR/storm@16bfb84 and cherry-pick the commit when you think it worths. I'd be really appreciated if you could also spend time to run performance test in your side and share numbers to see we are happy with the patch. (1.x-branch vs metrics_v2 vs metrics_v2_experimental_no_concurrentmap) @revans2 Could you review this again, including performance perspective? ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2203 I'm trying to optimize more since the number doesn't make me happy. I've applied pre-populating metrics (like @revans2 suggested) to avoid using ConcurrentHashMap. https://github.com/HeartSaVioR/storm/commits/metrics_v2_experimental_no_concurrentmap This commit (https://github.com/HeartSaVioR/storm/commit/16bfb84ebab98991fe4a8bd284dae9fe2bfe437d) represents the difference between 7947a07. and performance test result from same environment (the number may fluctuate, as we know): > 1.2.0-SNAPSHOT (9fbe283) - NOTE: ran earlier, just copied from above result >> 1st uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | stddev | user | sys | gc | mem -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 30 | 552,480 | 18,416.00 | 0 | 4,114,612,223 | 4,731,174,911 | 38,928,384 | 5,448,400,895 | 1,954,055,956.00 | 966,653,713.81 | 156,170 | 5,290 | 0 | 686.05 60 | 1,578,480 | 52,616.00 | 0 | 1,695,547,391 | 2,191,523,839 | 6,631,424 | 2,736,783,359 | 144,862,518.23 | 373,282,085.72 | 192,290 | 27,900 | 5,208 | 824.31 91 | 1,505,000 | 48,548.39 | 0 | 36,732,927 | 63,275,007 | 6,549,504 | 92,340,223 | 14,052,122.48 | 5,284,329.78 | 172,290 | 30,260 | 2,240 | 794.81 121 | 1,504,400 | 50,146.67 | 0 | 33,505,279 | 55,902,207 | 6,701,056 | 107,020,287 | 13,687,527.65 | 4,682,294.46 | 169,990 | 30,770 | 2,248 | 818.05 151 | 1,503,640 | 50,121.33 | 0 | 30,801,919 | 42,827,775 | 6,594,560 | 62,357,503 | 13,450,356.75 | 4,049,872.91 | 171,380 | 30,030 | 2,217 | 891.86 181 | 1,504,220 | 50,140.67 | 0 | 31,604,735 | 46,366,719 | 6,414,336 | 72,417,279 | 13,549,466.35 | 4,224,069.48 | 173,460 | 30,120 | 2,214 | 817.73 211 | 1,503,580 | 50,119.33 | 0 | 33,816,575 | 46,858,239 | 6,545,408 | 68,681,727 | 13,536,288.26 | 4,442,740.15 | 169,950 | 30,330 | 2,346 | 759.54 241 | 1,503,360 | 50,112.00 | 0 | 33,947,647 | 46,792,703 | 6,488,064 | 68,812,799 | 13,633,138.42 | 4,510,106.94 | 171,560 | 30,490 | 2,456 | 791.93 271 | 1,052,500 | 35,083.33 | 0 | 28,016,639 | 39,452,671 | 6,586,368 | 58,949,631 | 13,142,065.09 | 3,481,304.94 | 170,870 | 29,260 | 2,113 | 795.18 301 | 1,504,320 | 50,144.00 | 0 | 35,454,975 | 60,129,279 | 6,696,960 | 82,575,359 | 13,781,332.77 | 4,965,559.58 | 171,430 | 30,730 | 2,372 | 826.15 >> 2nd uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | stddev | user | sys | gc | mem -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 31 | 547,120 | 17,649.03 | 0 | 4,546,625,535 | 5,049,942,015 | 139,984,896 | 5,536,481,279 | 2,700,418,416.88 | 836,528,892.27 | 157,850 | 4,930 | 0 | 845.24 61 | 1,584,360 | 52,812.00 | 0 | 3,078,619,135 | 3,839,885,311 | 6,590,464 | 4,284,481,535 | 248,409,981.64 | 629,496,874.97 | 195,980 | 27,490 | 5,697 | 874.53 91 | 1,504,680 | 50,156.00 | 0 | 35,258,367 | 53,706,751 | 5,910,528 | 91,488,255 | 13,589,168.15 | 4,802,833.49 | 174,400 | 29,550 | 2,183 | 871.71 121 | 1,503,900 | 50,130.00 | 0 | 32,538,623 | 52,789,247 | 6,533,120 | 100,466,687 | 13,482,307.07 | 4,543,434.30 | 186,560 | 32,360 | 2,524 | 802.76 151 | 1,503,860 | 50,128.67 | 0 | 32,735,231 | 52,854,783 | 5,849,088 | 97,976,319 | 13,396,304.62 | 4,506,368.26 | 188,430 | 31,780 | 2,514 | 785.88 181 | 1,503,340 | 50,111.33 | 0 | 29,802,495 | 41,156,607 | 6,737,920 | 63,209,471 | 13,106,089.43 | 3,755,495.86 | 172,070 | 29,540 | 2,151 | 959.27 211 | 1,503,740 | 50,124.67 | 0 | 33,423,359 | 52,625,407 | 6,549,504 | 85,000,191 | 13,426,039.55 | 4,508,334.58 | 187,360 | 32,380 | 2,525 | 869.87 241 | 1,578,440 | 52,614.67 | 0 | 32,129,023 | 50,102,271 | 6,549,504 | 88,276,991 | 13,265,778.42 | 4,265,565.26 | 172,610 | 29,340 | 2,308 | 928.26 271 | 1,503,060 | 50,102.00 | 0 | 30,457,855 | 45,678,591 | 6,565,888 | 73,662,463 | 13,237,281.21 | 4,002,588.03 | 173,180 | 30,550 | 2,291 | 870.36 301 | 1,541,260 | 51,375.33 | 0 | 30,932,991 | 44,498,943 | 6,569,984 | 67,010,559 | 13,223,793.77 | 4,002,853.30 | 172,220 | 29,830 | 2,302 | 1,032.96 >> 3rd uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | stddev | user | sys | gc | mem -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 30 | 548,680 | 18,289.33 | 0 | 4,353,687,551 | 5,096,079,359 | 6,729,728 | 5,742,002,175 | 1,426,311,406.48 | 1,230,114,275.09 | 232,920 | 14,020 | 4,188 | 999.57 60 | 1,586,360 | 52,878.67 | 0 | 49,643,519 | 85,590,015 | 6,541,312 | 142,082,047 | 15,200,355.35 | 7,698,991.41 | 180,110 | 33,300 | 2,347 | 1,068.73 90 | 1,504,920 | 50,164.00 | 0 | 35,487,743 | 54,067,199 | 6,561,792 | 95,354,879 | 13,937,343.35 | 4,915,338.14 | 177,210 | 33,140 | 2,256 | 929.59 120 | 1,504,980 | 50,166.00 | 0 | 34,340,863 | 54,558,719 |
[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API
Github user ptgoetz commented on a diff in the pull request: https://github.com/apache/storm/pull/2203#discussion_r161139613 --- Diff: storm-core/src/jvm/org/apache/storm/metrics2/TaskMetrics.java --- @@ -0,0 +1,81 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.storm.metrics2; + +import com.codahale.metrics.Counter; +import org.apache.storm.task.WorkerTopologyContext; + +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; + +public class TaskMetrics { +ConcurrentMapackedByStream = new ConcurrentHashMap<>(); +ConcurrentMap failedByStream = new ConcurrentHashMap<>(); +ConcurrentMap emittedByStream = new ConcurrentHashMap<>(); +ConcurrentMap transferredByStream = new ConcurrentHashMap<>(); + +private String topologyId; +private String componentId; +private Integer taskId; +private Integer workerPort; + +public TaskMetrics(WorkerTopologyContext context, String componentId, Integer taskid){ +this.topologyId = context.getStormId(); +this.componentId = componentId; +this.taskId = taskid; +this.workerPort = context.getThisWorkerPort(); +} + +public Counter getAcked(String streamId) { +Counter c = this.ackedByStream.get(streamId); +if (c == null) { +c = StormMetricRegistry.counter("acked", this.topologyId, this.componentId, this.taskId, this.workerPort, streamId); +this.ackedByStream.put(streamId, c); +} +return c; +} + +public Counter getFailed(String streamId) { +Counter c = this.ackedByStream.get(streamId); --- End diff -- Good catch. Thanks for the fix. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user JessicaLHartog commented on the issue: https://github.com/apache/storm/pull/2433 @danny0405 @revans2 Thanks for keeping the Storm on Mesos project in mind. We (@erikdw, @srishtyagrawal, and myself) appreciate it! Right now, Storm on Mesos doesn't work for Storm versions 1.1.x and above because we rely on the slot-centric scheduling instead of the now supervisor-centric scheduling (for more details see [STORM-2126](https://issues.apache.org/jira/browse/STORM-2126)). We talked to @revans2 about this in our issue[#222](https://github.com/mesos/storm/issues/222#issuecomment-352514556)... so that's a bit of context that may be necessary to understand the problem a little better. With respect to the ask to not further break Storm on Mesos we think it would be best for us to be able to specify (at runtime) to the supervisor which port it needs to listen on for these heartbeat messages. The way we see it, while the Supervisor being able to specify a range for the heartbeat port sounds like it may work, the Supervisor should _also_ be able to accept an assigned heartbeat port. Namely, once we adapt to the changes suggested [here](https://github.com/mesos/storm/issues/222#issuecomment-352530608), we can start each Supervisor using one of the ports Mesos offers to us. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user danny0405 commented on the issue: https://github.com/apache/storm/pull/2433 @HeartSaVioR @revans2 Based on your comments, i thought there are mainly 3 TODOs for this patch: 1. Supervisor container-blity promotion, like support multiple supervisor instances on one machine. 2. Assignments security should be guaranteed. 3. Backwards compatibility for old version storm workers. For TODO 1 I can make supervisor thrift ports picked in a range for a machine node, and nimbus aware the port-info from SupervisorHeartbeats. Also i will passed the port as an start up argument so that workers will know its parent supervisor port. For TODO 2 i understood i should wait for @revans2's token authentication right? For TODO 3 i don't know how much worker would be taken, if @HeartSaVioR can help to contribute it will very appreciate it. Also the supervisor local disk worker heartbeats can also be removed actually, and it's easy to achieve for the patch, should i also do this? ---
[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/2203#discussion_r161135637 --- Diff: storm-core/src/jvm/org/apache/storm/metrics2/TaskMetrics.java --- @@ -0,0 +1,81 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.storm.metrics2; + +import com.codahale.metrics.Counter; +import org.apache.storm.task.WorkerTopologyContext; + +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; + +public class TaskMetrics { +ConcurrentMapackedByStream = new ConcurrentHashMap<>(); +ConcurrentMap failedByStream = new ConcurrentHashMap<>(); +ConcurrentMap emittedByStream = new ConcurrentHashMap<>(); +ConcurrentMap transferredByStream = new ConcurrentHashMap<>(); + +private String topologyId; +private String componentId; +private Integer taskId; +private Integer workerPort; + +public TaskMetrics(WorkerTopologyContext context, String componentId, Integer taskid){ +this.topologyId = context.getStormId(); +this.componentId = componentId; +this.taskId = taskid; +this.workerPort = context.getThisWorkerPort(); +} + +public Counter getAcked(String streamId) { +Counter c = this.ackedByStream.get(streamId); +if (c == null) { +c = StormMetricRegistry.counter("acked", this.topologyId, this.componentId, this.taskId, this.workerPort, streamId); +this.ackedByStream.put(streamId, c); +} +return c; +} + +public Counter getFailed(String streamId) { +Counter c = this.ackedByStream.get(streamId); --- End diff -- missed a spot: it should be `failedByStream` ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2203 Yes, my report mainly represents that the critical perf issue observed from Bobby has been resolved, but may still need to try out more if we want to go with detailed comparison. ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user roshannaik commented on the issue: https://github.com/apache/storm/pull/2203 @ptgoetz, I suggest measuring the perf impact with ACKers=0 as well (with workers=1). Having ACKing enabled or interworker-communication going on hides many perf issues. In addition to TVL, it is important to measure perf with other topos like [ConstSpoutNullBoltTopo](https://github.com/apache/storm/blob/master/examples/storm-perf/src/main/java/org/apache/storm/perf/ConstSpoutNullBoltTopo.java) and [ConstSpoutIdBoltNullBoltTopo.java](https://github.com/apache/storm/blob/master/examples/storm-perf/src/main/java/org/apache/storm/perf/ConstSpoutIdBoltNullBoltTopo.java). The ConstSpout* topos can expose perf issues that TVL might not... and vice versa. ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2203 I'm trying to run the performance test, with Ubuntu 17.10, 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:14:41 UTC 2018. I updated latest security patches today, which I guess it may include patches for Meltdown and Spectre. Maybe the machine will take less effect then others since the machine's CPU is Ryzen, as far as I follow the discussion on linux kernel (only for Meltdown... I'm not following the mailing list, but read relevant thread since it was interesting one.) Btw, both of builds will run in same environment, hence it shouldn't have much difference between results of both, I think. ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user ptgoetz commented on the issue: https://github.com/apache/storm/pull/2203 When reporting benchmark results, we should include OS patch level. The recent wave of patches will likely mess with benchmarks. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 @HeartSaVioR I just posted a high level design for a delegation tokens [here](https://issues.apache.org/jira/browse/STORM-2898?focusedCommentId=16323160=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16323160) Please take a look. If it looks good I will start throwing together a patch based on it. The we can get into some of the specifics associated with it. ---
[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API
Github user ptgoetz commented on a diff in the pull request: https://github.com/apache/storm/pull/2203#discussion_r161087710 --- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj --- @@ -265,6 +265,7 @@ :stats (mk-executor-stats <> (sampling-rate storm-conf)) :interval->task->metric-registry (HashMap.) :task->component (:task->component worker) + :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last task-ids) worker-context component-id) --- End diff -- @HeartSaVioR Nice, thanks! ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2203 It would need to have a performance test again but besides that I'm +1 on the change. ---
[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/2203#discussion_r161086016 --- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj --- @@ -265,6 +265,7 @@ :stats (mk-executor-stats <> (sampling-rate storm-conf)) :interval->task->metric-registry (HashMap.) :task->component (:task->component worker) + :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last task-ids) worker-context component-id) --- End diff -- Never mind. Just fixed and pushed new commit. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2433 @revans2 Thanks for double checking. I think you've also suggested removing the behavior leveraging disk to communicate between worker and supervisor (so that the way of communication is consistent), but you don't think it's not mandatory. Got it. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 @HeartSaVioR Perhaps we should have a call on the Metrics V2 work and what is its status for storm 2.0 because I don't totally know myself. What I think is mandatory for this patch is. 1) wait for delegation token work to go in so we can authenticate connections between the worker and nimbus/supervisors. 2) Add in some form of authorization to the newly added APIs. - getSupervisorAssignments should only be allowed from a supervisor - sendSupervisorWorkerHeartbeats should only be allowed from a supervisor. - sendSupervisorWorkerHeartbeat should verify that it has come from the owner of the topology the heartbeat is for - sendSupervisorAssignments needs to verify that it came from nimbus. - getLocalAssignmentForStorm needs to verify that it came from the owner of the topology. - sendSupervisorWorkerHeartbeat needs to verify that it came from the owner of the topology. 3) supervisor needs to pick a port from a configured allowable range of ports and get that information to everyone who is going to need it. If someone wants to drop the old heartbeat mechanisms for talking to nimbus and the supervisors that is fine with me. However, if we do drop it I really would love to have a way to maintain backwards compatibility because otherwise I will have to add it back in myself. If we don't drop it now I would like to see a follow on JIRA to remove it, but for that to work as part of a rolling upgrade we would need to support both mechanisms at the same time. ---
[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/2203#discussion_r161080512 --- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj --- @@ -265,6 +265,7 @@ :stats (mk-executor-stats <> (sampling-rate storm-conf)) :interval->task->metric-registry (HashMap.) :task->component (:task->component worker) + :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last task-ids) worker-context component-id) --- End diff -- We can create TaskMetrics instance in task-data and even get rid of the map lookup. I guess TaskMetrics doesn't have any executor specific metrics. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/2433 @revans2 Thanks again for your detailed explanation and taking up the hard security work which makes the patch stuck to go on. I see you are thinking about deprecating pacemaker after addressing the metrics load in ZK, then you are totally right we may want to avoid making it as default. At least I think now you are OK to go with this patch with some follow-up works, then metrics load in ZK would be removed when we finish all the necessary works and merge this in, which is OK for me. TBH I'm still unsure the migration plan from Metrics V1 to V2 (including built-in) since it also opens the possibility for metrics to be transferred in different way (maybe via metrics reporter), but that might be out of topic for this patch, and we could discuss that after merging Metrics V2 to both 1.x and master. Please double-check my understanding. If my understanding is right, you're suggesting to modify below things: 1. (a little unsure whether it is also mandatory for non-container) modify supervisor to pick a free port (range should be configurable) for thrift server instead of configured static value 2. change supervisor to include picked port information in heartbeat, and change nimbus to leverage the port information instead of configured static value for communicating via thrift RPC 3. change worker to report heartbeat to supervisor via RPC, not via local state (disk) which might be problematic from storm-mesos or other container solutions 4. address heartbeat fail-back mechanism for old version workers (reading from ZK) @danny0405 Could you check comments from @revans2 and provide inputs? Please also let me know if you would not want to deal with supporting for old version workers. I'll see I can address it on top of your patch. ---
[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API
Github user ptgoetz commented on the issue: https://github.com/apache/storm/pull/2203 @revans2 Could you take another look when you have a chance? ---
[GitHub] storm issue #2492: [STORM-2883] Fix storm-kafka trident API bug
Github user srdo commented on the issue: https://github.com/apache/storm/pull/2492 @dujiashu Happy to help. If you agree that we shouldn't make this change, would you mind closing this PR? Thanks. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 I was also just thinking that it would be really great if in the future we could drop even more access from the workers to zookeeper. If assignments are gone and heartbeats go away all that is left is credentials, backpressure, and possibly a few other small things. But backpressure is going to go away too so that would just be some small odds and ends. If we could add in a return object to the heartbeats, we could in the future add in support for fetching changed credentials, etc. and not have to worry about zookeeper except in client side code. So all I am asking is if for the heartbeat thrift calls if we could make it so the result code is an empty object and not void. That way we have a clean way to expand it in the future. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 So I just spoke with management here and did some quick back of the envelope planning and I think to unblock this we should support delegation token like functionality in storm. This could potentially make life a lot simpler for all kinds of things. Here and with DRPC, etc. I am willing to commit my team to make this happen, so I will file some JIRAs and try to put together a plan/architecture that hopefully others can review. Once we have delegation tokens working the only real issue is going to be containerized supervisors. I think we can support that by having the supervisor pick a free port in a configured range, and then include that port in it's heartbeat to nimbus. We would also need a way to tell the workers what port to use to communicate with the supervisor. For me I really would like to be able to maintain the ability to run 0.10.x and 1.x topologies under a 2.x cluster. I think this would only require still checking for heartbeats from zookeeper before scheduling which I don't think has been removed yet, so I am hopeful that it will work with the current patch. ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 @HeartSaVioR I didn't really want to make pacemaker the default because not everyone needs it, standing up yet another server is a pain for customers, and I was hoping it would be temporary. Once we have the metrics out of the heartbeat, which yes I hope to do in conditionally in 2.x and completely remove in 3.x, we can leave simple heartbeats in ZK, or move to push the heartbeats directly to nimbus, whichever is fine with me. Then there would be no need for pacemaker and it could be deprecated. Hopefully I answered most of your other questions in my response to @danny0405 [above](https://github.com/apache/storm/pull/2433#issuecomment-356965437) ---
[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 @danny0405 1. The issues we saw with stability were around when a pacemaker nodes goes down that it was causing exceptions in the clients that we were not handling properly resulting in workers restarting. We have not seen any issues with heartbeat packets being discarded. If this did not cover the issues you were seeing I really would love to have a JIRA so I can fix it. We are not running with 2.x, or even 1.x, in production so I cannot say if there is some oddness happening with what we have pushed back, or perhaps its interactions with HA. We are on 0.10.2, we pulled back a lot from 1.x. This is why we really want to get to 2.x so we can we aligned with the community again and hopefully not have these kinds of issues. There may be bugs we don't realize right now. 2. With pacemaker HA if you have 2+ pacemaker servers each of the clients will randomly select one of the servers to send heartbeats to. If the one they try to write to is down, at the beginning or in the middle, the heartbeats should then start going to a different, random, server. This should hopefully keep the load even between the pacemaker servers. Nimbus is supposed to work all of this out by reading from all the servers, and if it finds more than one heartbeat for a worker it will pick the one that has the newest timestamp in it. This does not scale well on the nimbus side, and can take more then 2 mins to download all of the heartbeats, so we have plans to parallelize the download. The metric don't go to the supervisor, as it does not need/use them currently. It only cares if the worker is up and still alive, so it knows if it needs to restart it. 3. I totally believe you that this can support a large cluster. Like I said this is a much better solution long term, and I would love to go this route. We just need to fix the security issues and find a way to support containerized supervisors for me to give it a +1. Both should be doable. 4. There is no security between the workers and pacemaker. There is security between nimbus and pacemaker. This means that only nimbus can see the heartbeats. The worst you can do with faking heartbeats is confuse someone with bad metrics (not ideal) or trick nimbus into thinking a worker is still alive when it is not, bad but not horrible. It is the assignment portion that is scary to me, because it says what to run. If we pull the assignment portion out I would be OK with that. Although it would be best to truly fix it because we don't have a way to selectively turn off authorization in thrift so to make that work we would need a separate thrift server on nimbus, which I would rather not do. I would love to see the ability to do delegation tokens in storm for authentication. This is no small task. It would take a lot of work, especially with HA, which is why I haven't done it. ---
Call for Presentations FOSS Backstage open
Hi, As announced on Berlin Buzzwords we (that is Isabel Drost-Fromm, Stefan Rudnitzki as well as the eventing team over at newthinking communications GmbH) are working on a new conference in summer in Berlin. The name of this new conference will be "FOSS Backstage". Backstage comprises all things FOSS governance, open collaboration and how to build and manage communities within the open source space. Submission URL: https://foss-backstage.de/call-papers The event will comprise presentations on all things FOSS governance, decentralised decision making, open collaboration. We invite you to submit talks on the topics: FOSS project governance, collaboration, community management. Asynchronous/ decentralised decision making. Vendor neutrality in FOSS, sustainable FOSS, cross team collaboration. Dealing with poisonous people. Project growth and hand-over. Trademarks. Strategic licensing. While it's primarily targeted at contributions from FOSS people, we would love to also learn more on how typical FOSS collaboration models work well within enterprises. Closely related topics not explicitly listed above are welcome. Important Dates (all dates in GMT +2) Submission deadline: February 18th, 2018. Conference: June, 13th/14th, 2018 High quality talks are called for, ranging from principles to practice. We are looking for real world case studies, background on the social architecture of specific projects and a deep dive into cross community collaboration. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. We have drafted the submission form to allow for regular talks, each 45 min in length. However you are free to submit your own ideas on how to support the event: If you would like to take our attendees out to show them your favourite bar in Berlin, please submit this offer through the CfP form. If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Schedule and further updates on the event will be published soon on the event web page. Please re-distribute this CfP to people who might be interested. Contact us at: newthinking communications GmbH Schoenhauser Allee 6/7 10119 Berlin, Germany i...@foss-backstage.de Looking forward to meeting you all in person in summer :) I would love to see all those tracks filled with lots of valuable talks on the Apache Way, on how we work, on how the incubator works, on how being a 501(c3) influences how people get involved and projects are being run, on how being a member run organisation is different, on merit for life, on growing communities, on things gone great - and things gone entirely wrong in the ASF's history, on how to interact with Apache projects as a corporation and everything else you can think of. Isabel -- Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most likely involving some kind of mobile connection only.)