[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2203
  
@ptgoetz 
Please review the change in HeartSaVioR/storm@16bfb84 and cherry-pick the 
commit when you think it worths. I'd be really appreciated if you could also 
spend time to run performance test in your side and share numbers to see we are 
happy with the patch. (1.x-branch vs metrics_v2 vs 
metrics_v2_experimental_no_concurrentmap)

@revans2 
Could you review this again, including performance perspective?


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2203
  
I'm trying to optimize more since the number doesn't make me happy. 

I've applied pre-populating metrics (like @revans2 suggested) to avoid 
using ConcurrentHashMap.

https://github.com/HeartSaVioR/storm/commits/metrics_v2_experimental_no_concurrentmap

This commit 
(https://github.com/HeartSaVioR/storm/commit/16bfb84ebab98991fe4a8bd284dae9fe2bfe437d)
 represents the difference between 7947a07.

and performance test result from same environment (the number may 
fluctuate, as we know):

> 1.2.0-SNAPSHOT (9fbe283) - NOTE: ran earlier, just copied from above 
result

>> 1st

uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | 
stddev | user | sys | gc | mem
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
30 | 552,480 | 18,416.00 | 0 | 4,114,612,223 | 4,731,174,911 | 38,928,384 | 
5,448,400,895 | 1,954,055,956.00 | 966,653,713.81 | 156,170 | 5,290 | 0 | 686.05
60 | 1,578,480 | 52,616.00 | 0 | 1,695,547,391 | 2,191,523,839 | 6,631,424 
| 2,736,783,359 | 144,862,518.23 | 373,282,085.72 | 192,290 | 27,900 | 5,208 | 
824.31
91 | 1,505,000 | 48,548.39 | 0 | 36,732,927 | 63,275,007 | 6,549,504 | 
92,340,223 | 14,052,122.48 | 5,284,329.78 | 172,290 | 30,260 | 2,240 | 794.81
121 | 1,504,400 | 50,146.67 | 0 | 33,505,279 | 55,902,207 | 6,701,056 | 
107,020,287 | 13,687,527.65 | 4,682,294.46 | 169,990 | 30,770 | 2,248 | 818.05
151 | 1,503,640 | 50,121.33 | 0 | 30,801,919 | 42,827,775 | 6,594,560 | 
62,357,503 | 13,450,356.75 | 4,049,872.91 | 171,380 | 30,030 | 2,217 | 891.86
181 | 1,504,220 | 50,140.67 | 0 | 31,604,735 | 46,366,719 | 6,414,336 | 
72,417,279 | 13,549,466.35 | 4,224,069.48 | 173,460 | 30,120 | 2,214 | 817.73
211 | 1,503,580 | 50,119.33 | 0 | 33,816,575 | 46,858,239 | 6,545,408 | 
68,681,727 | 13,536,288.26 | 4,442,740.15 | 169,950 | 30,330 | 2,346 | 759.54
241 | 1,503,360 | 50,112.00 | 0 | 33,947,647 | 46,792,703 | 6,488,064 | 
68,812,799 | 13,633,138.42 | 4,510,106.94 | 171,560 | 30,490 | 2,456 | 791.93
271 | 1,052,500 | 35,083.33 | 0 | 28,016,639 | 39,452,671 | 6,586,368 | 
58,949,631 | 13,142,065.09 | 3,481,304.94 | 170,870 | 29,260 | 2,113 | 795.18
301 | 1,504,320 | 50,144.00 | 0 | 35,454,975 | 60,129,279 | 6,696,960 | 
82,575,359 | 13,781,332.77 | 4,965,559.58 | 171,430 | 30,730 | 2,372 | 826.15

>> 2nd

uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | 
stddev | user | sys | gc | mem
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
31 | 547,120 | 17,649.03 | 0 | 4,546,625,535 | 5,049,942,015 | 139,984,896 
| 5,536,481,279 | 2,700,418,416.88 | 836,528,892.27 | 157,850 | 4,930 | 0 | 
845.24
61 | 1,584,360 | 52,812.00 | 0 | 3,078,619,135 | 3,839,885,311 | 6,590,464 
| 4,284,481,535 | 248,409,981.64 | 629,496,874.97 | 195,980 | 27,490 | 5,697 | 
874.53
91 | 1,504,680 | 50,156.00 | 0 | 35,258,367 | 53,706,751 | 5,910,528 | 
91,488,255 | 13,589,168.15 | 4,802,833.49 | 174,400 | 29,550 | 2,183 | 871.71
121 | 1,503,900 | 50,130.00 | 0 | 32,538,623 | 52,789,247 | 6,533,120 | 
100,466,687 | 13,482,307.07 | 4,543,434.30 | 186,560 | 32,360 | 2,524 | 802.76
151 | 1,503,860 | 50,128.67 | 0 | 32,735,231 | 52,854,783 | 5,849,088 | 
97,976,319 | 13,396,304.62 | 4,506,368.26 | 188,430 | 31,780 | 2,514 | 785.88
181 | 1,503,340 | 50,111.33 | 0 | 29,802,495 | 41,156,607 | 6,737,920 | 
63,209,471 | 13,106,089.43 | 3,755,495.86 | 172,070 | 29,540 | 2,151 | 959.27
211 | 1,503,740 | 50,124.67 | 0 | 33,423,359 | 52,625,407 | 6,549,504 | 
85,000,191 | 13,426,039.55 | 4,508,334.58 | 187,360 | 32,380 | 2,525 | 869.87
241 | 1,578,440 | 52,614.67 | 0 | 32,129,023 | 50,102,271 | 6,549,504 | 
88,276,991 | 13,265,778.42 | 4,265,565.26 | 172,610 | 29,340 | 2,308 | 928.26
271 | 1,503,060 | 50,102.00 | 0 | 30,457,855 | 45,678,591 | 6,565,888 | 
73,662,463 | 13,237,281.21 | 4,002,588.03 | 173,180 | 30,550 | 2,291 | 870.36
301 | 1,541,260 | 51,375.33 | 0 | 30,932,991 | 44,498,943 | 6,569,984 | 
67,010,559 | 13,223,793.77 | 4,002,853.30 | 172,220 | 29,830 | 2,302 | 1,032.96

>> 3rd

uptime | acked | acked/sec | failed | 99% | 99.9% | min | max | mean | 
stddev | user | sys | gc | mem
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
30 | 548,680 | 18,289.33 | 0 | 4,353,687,551 | 5,096,079,359 | 6,729,728 | 
5,742,002,175 | 1,426,311,406.48 | 1,230,114,275.09 | 232,920 | 14,020 | 4,188 
| 999.57
60 | 1,586,360 | 52,878.67 | 0 | 49,643,519 | 85,590,015 | 6,541,312 | 
142,082,047 | 15,200,355.35 | 7,698,991.41 | 180,110 | 33,300 | 2,347 | 1,068.73
90 | 1,504,920 | 50,164.00 | 0 | 35,487,743 | 54,067,199 | 6,561,792 | 
95,354,879 | 13,937,343.35 | 4,915,338.14 | 177,210 | 33,140 | 2,256 | 929.59
120 | 1,504,980 | 50,166.00 | 0 | 34,340,863 | 54,558,719 | 

[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread ptgoetz
Github user ptgoetz commented on a diff in the pull request:

https://github.com/apache/storm/pull/2203#discussion_r161139613
  
--- Diff: storm-core/src/jvm/org/apache/storm/metrics2/TaskMetrics.java ---
@@ -0,0 +1,81 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.metrics2;
+
+import com.codahale.metrics.Counter;
+import org.apache.storm.task.WorkerTopologyContext;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+public class TaskMetrics {
+ConcurrentMap ackedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap failedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap emittedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap transferredByStream = new 
ConcurrentHashMap<>();
+
+private String topologyId;
+private String componentId;
+private Integer taskId;
+private Integer workerPort;
+
+public TaskMetrics(WorkerTopologyContext context, String componentId, 
Integer taskid){
+this.topologyId = context.getStormId();
+this.componentId = componentId;
+this.taskId = taskid;
+this.workerPort = context.getThisWorkerPort();
+}
+
+public Counter getAcked(String streamId) {
+Counter c = this.ackedByStream.get(streamId);
+if (c == null) {
+c = StormMetricRegistry.counter("acked", this.topologyId, 
this.componentId, this.taskId, this.workerPort, streamId);
+this.ackedByStream.put(streamId, c);
+}
+return c;
+}
+
+public Counter getFailed(String streamId) {
+Counter c = this.ackedByStream.get(streamId);
--- End diff --

Good catch. Thanks for the fix.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread JessicaLHartog
Github user JessicaLHartog commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 @revans2 Thanks for keeping the Storm on Mesos project in mind. 
We (@erikdw, @srishtyagrawal, and myself) appreciate it!

Right now, Storm on Mesos doesn't work for Storm versions 1.1.x and above 
because we rely on the slot-centric scheduling instead of the now 
supervisor-centric scheduling (for more details see 
[STORM-2126](https://issues.apache.org/jira/browse/STORM-2126)). We talked to 
@revans2 about this in our 
issue[#222](https://github.com/mesos/storm/issues/222#issuecomment-352514556)...
 so that's a bit of context that may be necessary to understand the problem a 
little better.

With respect to the ask to not further break Storm on Mesos we think it 
would be best for us to be able to specify (at runtime) to the supervisor which 
port it needs to listen on for these heartbeat messages. The way we see it, 
while the Supervisor being able to specify a range for the heartbeat port 
sounds like it may work, the Supervisor should _also_ be able to accept an 
assigned heartbeat port. Namely, once we adapt to the changes suggested 
[here](https://github.com/mesos/storm/issues/222#issuecomment-352530608), we 
can start each Supervisor using one of the ports Mesos offers to us.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @revans2
Based on your comments, i thought there are mainly 3 TODOs for this patch:
1. Supervisor container-blity promotion, like support multiple 
supervisor instances on one machine.
2. Assignments security should be guaranteed.
3. Backwards compatibility for old version storm workers.

For TODO 1 I can make supervisor thrift ports picked in a range for a 
machine node, and nimbus aware the port-info from SupervisorHeartbeats. Also i 
will passed the port as an start up argument so that workers will know its 
parent supervisor port.

For TODO 2 i understood i should wait for @revans2's token authentication 
right?

For TODO 3 i don't know how much worker would be taken, if @HeartSaVioR can 
help to contribute it will very appreciate it.

Also the supervisor local disk worker heartbeats can also be removed 
actually, and it's easy to achieve for the patch, should i also do this?


---


[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/storm/pull/2203#discussion_r161135637
  
--- Diff: storm-core/src/jvm/org/apache/storm/metrics2/TaskMetrics.java ---
@@ -0,0 +1,81 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.metrics2;
+
+import com.codahale.metrics.Counter;
+import org.apache.storm.task.WorkerTopologyContext;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+public class TaskMetrics {
+ConcurrentMap ackedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap failedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap emittedByStream = new 
ConcurrentHashMap<>();
+ConcurrentMap transferredByStream = new 
ConcurrentHashMap<>();
+
+private String topologyId;
+private String componentId;
+private Integer taskId;
+private Integer workerPort;
+
+public TaskMetrics(WorkerTopologyContext context, String componentId, 
Integer taskid){
+this.topologyId = context.getStormId();
+this.componentId = componentId;
+this.taskId = taskid;
+this.workerPort = context.getThisWorkerPort();
+}
+
+public Counter getAcked(String streamId) {
+Counter c = this.ackedByStream.get(streamId);
+if (c == null) {
+c = StormMetricRegistry.counter("acked", this.topologyId, 
this.componentId, this.taskId, this.workerPort, streamId);
+this.ackedByStream.put(streamId, c);
+}
+return c;
+}
+
+public Counter getFailed(String streamId) {
+Counter c = this.ackedByStream.get(streamId);
--- End diff --

missed a spot: it should be `failedByStream`


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2203
  
Yes, my report mainly represents that the critical perf issue observed from 
Bobby has been resolved, but may still need to try out more if we want to go 
with detailed comparison.


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2203
  
@ptgoetz,  I suggest measuring the perf impact with ACKers=0 as well (with 
workers=1).  Having ACKing enabled or interworker-communication going on hides 
many perf issues.  

In addition to TVL, it is important to measure perf with other topos like 
[ConstSpoutNullBoltTopo](https://github.com/apache/storm/blob/master/examples/storm-perf/src/main/java/org/apache/storm/perf/ConstSpoutNullBoltTopo.java)
 and 
[ConstSpoutIdBoltNullBoltTopo.java](https://github.com/apache/storm/blob/master/examples/storm-perf/src/main/java/org/apache/storm/perf/ConstSpoutIdBoltNullBoltTopo.java).
  

The ConstSpout* topos can expose perf issues that TVL might not... and vice 
versa. 


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2203
  
I'm trying to run the performance test, with Ubuntu 17.10, 
4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:14:41 UTC 2018.

I updated latest security patches today, which I guess it may include 
patches for Meltdown and Spectre. Maybe the machine will take less effect then 
others since the machine's CPU is Ryzen, as far as I follow the discussion on 
linux kernel (only for Meltdown... I'm not following the mailing list, but read 
relevant thread since it was interesting one.)

Btw, both of builds will run in same environment, hence it shouldn't have 
much difference between results of both, I think.


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread ptgoetz
Github user ptgoetz commented on the issue:

https://github.com/apache/storm/pull/2203
  
When reporting benchmark results, we should include OS patch level. The 
recent wave of patches will likely mess with benchmarks.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR I just posted a high level design for a delegation tokens 
[here](https://issues.apache.org/jira/browse/STORM-2898?focusedCommentId=16323160=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16323160)

Please take a look.  If it looks good I will start throwing together a 
patch based on it.  The we can get into some of the specifics associated with 
it.


---


[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread ptgoetz
Github user ptgoetz commented on a diff in the pull request:

https://github.com/apache/storm/pull/2203#discussion_r161087710
  
--- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj ---
@@ -265,6 +265,7 @@
  :stats (mk-executor-stats <> (sampling-rate storm-conf))
  :interval->task->metric-registry (HashMap.)
  :task->component (:task->component worker)
+ :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last 
task-ids) worker-context component-id)
--- End diff --

@HeartSaVioR Nice, thanks!


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2203
  
It would need to have a performance test again but besides that I'm +1 on 
the change.


---


[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/storm/pull/2203#discussion_r161086016
  
--- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj ---
@@ -265,6 +265,7 @@
  :stats (mk-executor-stats <> (sampling-rate storm-conf))
  :interval->task->metric-registry (HashMap.)
  :task->component (:task->component worker)
+ :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last 
task-ids) worker-context component-id)
--- End diff --

Never mind. Just fixed and pushed new commit.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Thanks for double checking. I think you've also suggested removing the 
behavior leveraging disk to communicate between worker and supervisor (so that 
the way of communication is consistent), but you don't think it's not 
mandatory. Got it.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR 

Perhaps we should have a call on the Metrics V2 work and what is its status 
for storm 2.0 because I don't totally know myself.

What I think is mandatory for this patch is.

1) wait for delegation token work to go in so we can authenticate 
connections between the worker and nimbus/supervisors.
2) Add in some form of authorization to the newly added APIs. 
   - getSupervisorAssignments should only be allowed from a supervisor
   - sendSupervisorWorkerHeartbeats should only be allowed from a 
supervisor.
   - sendSupervisorWorkerHeartbeat should verify that it has come from the 
owner of the topology the heartbeat is for
   - sendSupervisorAssignments needs to verify that it came from nimbus.
   - getLocalAssignmentForStorm needs to verify that it came from the owner 
of the topology.
   - sendSupervisorWorkerHeartbeat needs to verify that it came from the 
owner of the topology.
3) supervisor needs to pick a port from a configured allowable range of 
ports and get that information to everyone who is going to need it.

If someone wants to drop the old heartbeat mechanisms for talking to nimbus 
and the supervisors that is fine with me.  However, if we do drop it I really 
would love to have a way to maintain backwards compatibility because otherwise 
I will have to add it back in myself.

If we don't drop it now I would like to see a follow on JIRA to remove it, 
but for that to work as part of a rolling upgrade we would need to support both 
mechanisms at the same time.


---


[GitHub] storm pull request #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/storm/pull/2203#discussion_r161080512
  
--- Diff: storm-core/src/clj/org/apache/storm/daemon/executor.clj ---
@@ -265,6 +265,7 @@
  :stats (mk-executor-stats <> (sampling-rate storm-conf))
  :interval->task->metric-registry (HashMap.)
  :task->component (:task->component worker)
+ :task-metrics (TaskMetrics/taskMetricsMap (first task-ids) (last 
task-ids) worker-context component-id)
--- End diff --

We can create TaskMetrics instance in task-data and even get rid of the map 
lookup. I guess TaskMetrics doesn't have any executor specific metrics.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Thanks again for your detailed explanation and taking up the hard security 
work which makes the patch stuck to go on. 

I see you are thinking about deprecating pacemaker after addressing the 
metrics load in ZK, then you are totally right we may want to avoid making it 
as default. At least I think now you are OK to go with this patch with some 
follow-up works, then metrics load in ZK would be removed when we finish all 
the necessary works and merge this in, which is OK for me.

TBH I'm still unsure the migration plan from Metrics V1 to V2 (including 
built-in) since it also opens the possibility for metrics to be transferred in 
different way (maybe via metrics reporter), but that might be out of topic for 
this patch, and we could discuss that after merging Metrics V2 to both 1.x and 
master.

Please double-check my understanding. If my understanding is right, you're 
suggesting to modify below things:

1. (a little unsure whether it is also mandatory for non-container) modify 
supervisor to pick a free port (range should be configurable) for thrift server 
instead of configured static value
2. change supervisor to include picked port information in heartbeat, and 
change nimbus to leverage the port information instead of configured static 
value for communicating via thrift RPC
3. change worker to report heartbeat to supervisor via RPC, not via local 
state (disk) which might be problematic from storm-mesos or other container 
solutions
4. address heartbeat fail-back mechanism for old version workers (reading 
from ZK)

@danny0405 
Could you check comments from @revans2 and provide inputs? Please also let 
me know if you would not want to deal with supporting for old version workers. 
I'll see I can address it on top of your patch.


---


[GitHub] storm issue #2203: STORM-2153: New Metrics Reporting API

2018-01-11 Thread ptgoetz
Github user ptgoetz commented on the issue:

https://github.com/apache/storm/pull/2203
  
@revans2 Could you take another look when you have a chance? 


---


[GitHub] storm issue #2492: [STORM-2883] Fix storm-kafka trident API bug

2018-01-11 Thread srdo
Github user srdo commented on the issue:

https://github.com/apache/storm/pull/2492
  
@dujiashu Happy to help.

If you agree that we shouldn't make this change, would you mind closing 
this PR? Thanks.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I was also just thinking that it would be really great if in the future we 
could drop even more access from the workers to zookeeper.  If assignments are 
gone and heartbeats go away all that is left is credentials, backpressure, and 
possibly a few other small things.  But backpressure is going to go away too so 
that would just be some small odds and ends.  If we could add in a return 
object to the heartbeats, we could in the future add in support for fetching 
changed credentials, etc. and not have to worry about zookeeper except in 
client side code.

So all I am asking is if for the heartbeat thrift calls if we could make it 
so the result code is an empty object and not void.  That way we have a clean 
way to expand it in the future.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
So I just spoke with management here and did some quick back of the 
envelope planning and I think to unblock this we should support delegation 
token like functionality in storm.  This could potentially make life a lot 
simpler for all kinds of things.  Here and with DRPC, etc.

I am willing to commit my team to make this happen, so I will file some 
JIRAs and try to put together a plan/architecture that hopefully others can 
review.

Once we have delegation tokens working the only real issue is going to be 
containerized supervisors.  I think we can support that by having the 
supervisor pick a free port in a configured range, and then include that port 
in it's heartbeat to nimbus.  We would also need a way to tell the workers what 
port to use to communicate with the supervisor.

For me I really would like to be able to maintain the ability to run 0.10.x 
and 1.x topologies under a 2.x cluster.  I think this would only require still 
checking for heartbeats from zookeeper before scheduling which I don't think 
has been removed yet, so I am hopeful that it will work with the current patch.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR

I didn't really want to make pacemaker the default because not everyone 
needs it, standing up yet another server is a pain for customers, and I was 
hoping it would be temporary.  Once we have the metrics out of the heartbeat, 
which yes I hope to do in conditionally in 2.x and completely remove in 3.x, we 
can leave simple heartbeats in ZK, or move to push the heartbeats directly to 
nimbus, whichever is fine with me.  Then there would be no need for pacemaker 
and it could be deprecated.

Hopefully I answered most of your other questions in my response to 
@danny0405 
[above](https://github.com/apache/storm/pull/2433#issuecomment-356965437)


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 

1.  The issues we saw with stability were around when a pacemaker nodes 
goes down that it was causing exceptions in the clients that we were not 
handling properly resulting in workers restarting.  We have not seen any issues 
with heartbeat packets being discarded.  If this did not cover the issues you 
were seeing I really would love to have a JIRA so I can fix it.

We are not running with 2.x, or even 1.x, in production so I cannot say if 
there is some oddness happening with what we have pushed back, or perhaps its 
interactions with HA.  We are on 0.10.2, we pulled back a lot from 1.x.  
This is why we really want to get to 2.x so we can we aligned with the 
community again and hopefully not have these kinds of issues.  There may be 
bugs we don't realize right now.

2. With pacemaker HA if you have 2+ pacemaker servers each of the clients 
will randomly select one of the servers to send heartbeats to.  If the one they 
try to write to is down, at the beginning or in the middle, the heartbeats 
should then start going to a different, random, server.  This should hopefully 
keep the load even between the pacemaker servers.  Nimbus is supposed to work 
all of this out by reading from all the servers, and if it finds more than one 
heartbeat for a worker it will pick the one that has the newest timestamp in 
it.  This does not scale well on the nimbus side, and can take more then 2 mins 
to download all of the heartbeats, so we have plans to parallelize the download.

The metric don't go to the supervisor, as it does not need/use them 
currently.  It only cares if the worker is up and still alive, so it knows if 
it needs to restart it.

3. I totally believe you that this can support a large cluster.  Like I 
said this is a much better solution long term, and I would love to go this 
route.  We just need to fix the security issues and find a way to support 
containerized supervisors for me to give it a +1. Both should be doable.

4. There is no security between the workers and pacemaker.  There is 
security between nimbus and pacemaker.  This means that only nimbus can see the 
heartbeats.  The worst you can do with faking heartbeats is confuse someone 
with bad metrics (not ideal) or trick nimbus into thinking a worker is still 
alive when it is not, bad but not horrible.  It is the assignment portion that 
is scary to me, because it says what to run.  If we pull the assignment portion 
out I would be OK with that.  Although it would be best to truly fix it because 
we don't have a way to selectively turn off authorization in thrift so to make 
that work we would need a separate thrift server on nimbus, which I would 
rather not do.

I would love to see the ability to do delegation tokens in storm for 
authentication.  This is no small task.  It would take a lot of work, 
especially with HA, which is why I haven't done it.


---


Call for Presentations FOSS Backstage open

2018-01-11 Thread Isabel Drost-Fromm
Hi,

As announced on Berlin Buzzwords we (that is Isabel Drost-Fromm, Stefan
Rudnitzki as well as the eventing team over at newthinking communications GmbH)
are working on a new conference in summer in Berlin. The name of this new
conference will be "FOSS Backstage". Backstage comprises all things
FOSS governance, open collaboration and how to build and manage communities
within the open source space.


Submission URL: https://foss-backstage.de/call-papers 

The event will comprise presentations on all things FOSS governance,
decentralised decision making, open collaboration. We invite you to submit talks
on the topics: FOSS project governance, collaboration, community management.
Asynchronous/ decentralised decision making.  Vendor neutrality in FOSS,
sustainable FOSS, cross team collaboration.  Dealing with poisonous people.
Project growth and hand-over. Trademarks. Strategic licensing.  While it's
primarily targeted at contributions from FOSS people, we would love to also
learn more on how typical FOSS collaboration models work well within
enterprises. Closely related topics not explicitly listed above are welcome. 

Important Dates (all dates in GMT +2)

Submission deadline: February 18th, 2018.

Conference: June, 13th/14th, 2018


High quality talks are called for, ranging from principles to practice. We are
looking for real world case studies, background on the social architecture of
specific projects and a deep dive into cross community collaboration.
Acceptance notifications will be sent out soon after the submission deadline.
Please include your name, bio and email, the title of the talk, a brief abstract
in English language.

We have drafted the submission form to allow for regular talks, each 45 min in
length. However you are free to submit your own ideas on how to support the
event: If you would like to take our attendees out to show them your favourite
bar in Berlin, please submit this offer through the CfP form.  If you are
interested in sponsoring the event (e.g. we would be happy to provide videos
after the event, free drinks for attendees as well as an after-show party),
please contact us.

Schedule and further updates on the event will be published soon on the event
web page.

Please re-distribute this CfP to people who might be interested.

 Contact us at:
 newthinking communications GmbH
 Schoenhauser Allee 6/7
 10119 Berlin, Germany
 i...@foss-backstage.de


Looking forward to meeting you all in person in summer :) I would love to see 
all those
tracks filled with lots of valuable talks on the Apache Way, on how we work,
on how the incubator works, on how being a 501(c3) influences how people get 
involved
and projects are being run, on how being a member run organisation is different,
on merit for life, on growing communities, on things gone great - and things
gone entirely wrong in the ASF's history, on how to interact with Apache
projects as a corporation and everything else you can think of.


Isabel


-- 
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most 
likely involving some kind of mobile connection only.)