[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2018-01-04 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
This PR page has become unusable. Most of the time it doesn't load. If it 
does load, it is too slow to use...even for just typing a comment.  Let stop 
using this and switch to the  [new 
PR](https://github.com/apache/storm/pull/2502) (thanks for the suggestion 
@HeartSaVioR ) There are a few recent unaddressed comments on this PR that I 
will track and address in the new PR.


---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2018-01-04 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
Sorry it has taken me so long to get back to this.  I did a quick pass 
through the code, but I need to spend some time to go more in depth.  I expect 
to be doing this this afternoon and tomorrow.

My biggest concern right now is around up-merging and getting the unit 
tests to pass.  I don't see a reason to have JCQueueTest fail 100% of the time.

I am also concerned that a lot of other unit tests fail 100% of the time 
with this patch too.


---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-11-06 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 thanks very much for the offer to help, i think it might be useful 
to get past this issue that is blocking this. 

Just updated the PR with these two key changes. 

- Discovered that workers need to communicate their internal back pressure 
situation to other workers so that they stop/resume sending messages to 
components experiencing BP.  
- Now Bolts also use non-blocking emits so that in case of BP they can 
continue to process metrics ticks. 


**Curent Status:**
- Some new metrics need for monitoring overflow needs to be introduced. 
Also some minor todos mentioned in code need to be addressed.
- During scale testing, found that under certain multi-worker scenarios, 
noticed a inexplicable performance drop. The simplest way to reproduce the 
issue appears to be using a 2 workers setup with spouts and bolts distributed 
so that there is messaging occurring in both directions. Don't have a good grip 
on the issue yet.  Unfortunately last 3 weeks had to switch my attention to 
some other pressing release centric tasks at work.

May be useful to sync up with you offline soon and see if looking into the 
issue together helps.


---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-31 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@HeartSaVioR  : Agree with your observations on being able to process ACKs 
under BP. My latest update allows that in ACK mode. I have updated section 3.6 
in the design doc with info on that. It has one limitation   in ACK=0 mode, 
it wont try to process metrics under BP. Will try to fix it in this PR if 
possible (or defer it to a followup PR if thats ok). 
First i would like to fix this one multi-worker issue that is remaining and 
finish collecting the numbers for the test plan to share with community As 
it also helps me identify any important problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-21 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik I just created a new pull request to master for an updated 
version of ThroughputVsLatency and I added in the test tools that I created for 
being able to capture and simulate topologies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-19 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2  was the max.spout.pending=null for 2306 in the last run ? If so  
thats worth setting to something like 500/1k/nearabouts .. for multiworker mode 
... should fix the latency & help throughput.  Tuning bolt waitstrategy should 
help with CPU utilization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-17 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2. Thanks, I have fixed that issue in the latest commit. Also added 
a perf tuning document 
([Performance.md](https://github.com/roshannaik/storm/blob/4704245e44b0868cfa5237f97a1fca0686df11a9/docs/Performance.md))
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-17 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik I found another bug in the code.  For batching you are deciding 
to turn on or off flushing based off of the system conf, and not the topology 
conf.  This means that I as a topology owner can turn on batching, but flushing 
will still be disabled. You need to use the topology conf to decide if batching 
is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-16 Thread Alexandre Vermeerbergen
Hello,

This tool seems like a great idea for tuning topologies.

I'll see if we can use it for ours, we use lots of Kafka (old Spout),
Redis, OpenTSDB and HBase (with our own bolts for these later) around our
topologies.

Best regards,
Alexandre Vermeerbergen



2017-08-16 16:31 GMT+02:00 revans2 :

> Github user revans2 commented on the issue:
>
> https://github.com/apache/storm/pull/2241
>
> @roshannaik I have some new performance numbers.  These are not final,
> as I have not done any tuning yet.  But let me explain the test.  I really
> wanted to get a handle on how this would impact me, yes I am very selfish
> that way.  I have been wanting for a long time to be able to write a tool
> that could simulate this, but it was never high enough up on my priority
> list to make it happen until now.  Thanks for the nudge to do this by the
> way.
>
> So I wrote a [tool](https://github.com/revans2/incubator-storm/tree/
> STORM-2306-with-loadgen/examples/storm-starter/src/
> jvm/org/apache/storm/starter/loadgen) and will do a formal pull request
> after some more features, cleanup, and moving it into its own stand-alone
> package.
>
> The tool will capture a snapshot of topologies running on a cluster.
> It grabs the bolts, spouts, and all of the streams connecting them along
> with metrics for those streams.  The metrics include the output rate and
> the process/execute latency for each incoming stream.
>
> I can then simulate the throughput for each stream and the latency for
> each bolt (I use a Gaussian distribution that matches the measured
> distribution).
>
> For this particular test I captured 104 topologies from a single
> production cluster 600+ nodes with 19k+ CPU cores and about 70 TB of
> memory.  The nodes are heterogeneous and the cluster + all of the
> topologies are tuned for the Resource Aware Scheduler. (I really want to
> release these captured metrics, but I need to fight with legal to get
> approval to do it first).  It is a mix of trident and regular storm
> topologies.  Some with very high throughput (850k/sec in the highest case)
> and some very low throughput ones.  I then set up another much smaller
> cluster 9 nodes (1 for nimbus + ZK, 8 for supervisors).
>
> ```
> Processors: 2 x Xeon E5-2430 2.20GHz, 7.2GT QPI (HT enabled, 12 cores,
> 24 threads) - Sandy Bridge-EP C2, 64-bit, 6-core, 32nm, L3: 15MB
> Memory: 46.7GB / 48GB 1600MHz DDR3 == 6 x 8GB - 8GB
> PC3-12800 Samsung DDR3-1600 ECC Registered CL11 4Rx4
> RHEL 6.7
> java 1.8.0_131-b11
> ```
>
> This was a non RAS cluster.  The only settings I changed for the
> cluster were the location of nimbus and zookeeper.
>
> I then replayed each of the topologies one at a time with default
> settings + `-c topology.worker.max.heap.size.mb=5500 -c
> nimbus.thrift.max_buffer_size=10485760`.  (Takes almost 9 hours because
> each test runs for about 5 mins to be sure that the cluster has stabilized)
>
> The first changed setting is because I had a few issues with nimbus
> rejecting some topologies because they had some RAS settings for a very
> large cluster and the default settings in storm did not allow them to run.
> The second one is because on a few very large topologies the client was
> rejecting results from nimbus because the thrift metrics were too large
> (1MB is the default cutoff).
>
> Of these 104 topologies 95 of them ran on 2.x-SNAPSHOT without any
> real issues and were able to keep up with the throughput.  For STORM-2306
> 94 of them ran without any issues and were able to keep up with the
> throughput.  I have not debugged yet the one topology that would just stop
> processing on STORM-2306 but was running fine on 2.x-SNAPSHOT.
>
> From those I measured the latency and CPU/Memory used (just like with
> ThroughputVsLatency).  My goal was to see how these changed from one
> version of storm to another and get an idea of how much pain out of the box
> it would be for me and my customers to start moving to it, again my own
> selfishness here in action.  As I said I have not tuned any of the
> topologies yet.  I plan on doing that next to see if I can improve some of
> the measurements.  Here are my results.  As a note for COST I used 1 core
> equal to 2 GB of memory because that is the ratio we bought hardware at
> originally (but not so much any more).
>
>
>   | CPU Summary | Memory Summary | Cost Summary | Avg Latency Diff ms
> (weighted by throughput)
> -- | -- | -- | -- | --
> Measured Diff | about 250 more needed | about 27 GB less needed | 235
> more needed | 5.22 ms more per tuple
> In Cluster Total | about 19k cores | about 70TB | about 55k |
> Percent diff of cluster total | 1.28% | -0.04% | 0.43% |
> In Cluster Assigned | about 13k cores | about 50TB | about 40k |
> Percent of Assigned | 1.92% | -0.05% | 0.61% |
> Amount Per Node | 47 | about 200GB|   |
> Change (Nodes) | 6 | -0 |   |

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-16 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik I have some new performance numbers.  These are not final, as I 
have not done any tuning yet.  But let me explain the test.  I really wanted to 
get a handle on how this would impact me, yes I am very selfish that way.  I 
have been wanting for a long time to be able to write a tool that could 
simulate this, but it was never high enough up on my priority list to make it 
happen until now.  Thanks for the nudge to do this by the way.

So I wrote a 
[tool](https://github.com/revans2/incubator-storm/tree/STORM-2306-with-loadgen/examples/storm-starter/src/jvm/org/apache/storm/starter/loadgen)
 and will do a formal pull request after some more features, cleanup, and 
moving it into its own stand-alone package.

The tool will capture a snapshot of topologies running on a cluster.  It 
grabs the bolts, spouts, and all of the streams connecting them along with 
metrics for those streams.  The metrics include the output rate and the 
process/execute latency for each incoming stream.

I can then simulate the throughput for each stream and the latency for each 
bolt (I use a Gaussian distribution that matches the measured distribution).

For this particular test I captured 104 topologies from a single production 
cluster 600+ nodes with 19k+ CPU cores and about 70 TB of memory.  The nodes 
are heterogeneous and the cluster + all of the topologies are tuned for the 
Resource Aware Scheduler. (I really want to release these captured metrics, but 
I need to fight with legal to get approval to do it first).  It is a mix of 
trident and regular storm topologies.  Some with very high throughput (850k/sec 
in the highest case) and some very low throughput ones.  I then set up another 
much smaller cluster 9 nodes (1 for nimbus + ZK, 8 for supervisors).

```
Processors: 2 x Xeon E5-2430 2.20GHz, 7.2GT QPI (HT enabled, 12 cores, 24 
threads) - Sandy Bridge-EP C2, 64-bit, 6-core, 32nm, L3: 15MB
Memory: 46.7GB / 48GB 1600MHz DDR3 == 6 x 8GB - 8GB PC3-12800 
Samsung DDR3-1600 ECC Registered CL11 4Rx4
RHEL 6.7
java 1.8.0_131-b11
```

This was a non RAS cluster.  The only settings I changed for the cluster 
were the location of nimbus and zookeeper.

I then replayed each of the topologies one at a time with default settings 
+ `-c topology.worker.max.heap.size.mb=5500 -c 
nimbus.thrift.max_buffer_size=10485760`.  (Takes almost 9 hours because each 
test runs for about 5 mins to be sure that the cluster has stabilized)

The first changed setting is because I had a few issues with nimbus 
rejecting some topologies because they had some RAS settings for a very large 
cluster and the default settings in storm did not allow them to run.  The 
second one is because on a few very large topologies the client was rejecting 
results from nimbus because the thrift metrics were too large (1MB is the 
default cutoff).

Of these 104 topologies 95 of them ran on 2.x-SNAPSHOT without any real 
issues and were able to keep up with the throughput.  For STORM-2306 94 of them 
ran without any issues and were able to keep up with the throughput.  I have 
not debugged yet the one topology that would just stop processing on STORM-2306 
but was running fine on 2.x-SNAPSHOT.

From those I measured the latency and CPU/Memory used (just like with 
ThroughputVsLatency).  My goal was to see how these changed from one version of 
storm to another and get an idea of how much pain out of the box it would be 
for me and my customers to start moving to it, again my own selfishness here in 
action.  As I said I have not tuned any of the topologies yet.  I plan on doing 
that next to see if I can improve some of the measurements.  Here are my 
results.  As a note for COST I used 1 core equal to 2 GB of memory because that 
is the ratio we bought hardware at originally (but not so much any more).


  | CPU Summary | Memory Summary | Cost Summary | Avg Latency Diff ms 
(weighted by throughput)
-- | -- | -- | -- | --
Measured Diff | about 250 more needed | about 27 GB less needed | 235 more 
needed | 5.22 ms more per tuple
In Cluster Total | about 19k cores | about 70TB | about 55k |  
Percent diff of cluster total | 1.28% | -0.04% | 0.43% |  
In Cluster Assigned | about 13k cores | about 50TB | about 40k |  
Percent of Assigned | 1.92% | -0.05% | 0.61% |  
Amount Per Node | 47 | about 200GB|   |  
Change (Nodes) | 6 | -0 |   |  

As I said I have not tuned anything yet.  I need to because the assumption 
here is that users will need to tune their topologies for higher throughput 
jobs.  I don't like that because it adds to the pain for my users, but I will 
try to quantify that in some way.

As it stands now things don't look good for STORM-2306 (and I know running 
a test that 

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-14 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2  for STORM-2306,  Somehow the latest commits out there are not 
reflecting here in this PR. While i figure out the issue, please checkout the 
2306 code directly from https://github.com/roshannaik/storm/tree/STORM-2306m . 
I will be using commit# 2506c6e from there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-14 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik yes we can use aaebc3b as the base for tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-11 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 
- I suspect we are talking about diff things wrt Thread safety in emit path 
and consequently I am missing what you are trying to communicate. There is no 
intention to require any synchronization for  emits from independent bolt 
instances of same/diff types. May have to sync up with you on the problem 
separately. 

- Yes I see the process latency metric stuck at -1ms issue. Let me look 
into this. 

-  To avoid chasing a moving target be on the same page for the test 
runs... are you ok with using master version# aaebc3b  ?  If you have a 
different version of master in mind...  i am happy to use that instead. 

- Also I plan to fix on these settings in storm.yaml for both master & 
2306. Other settings may vary depending on topology.
```
worker.heap.memory.mb : 4096
topology.disable.loadaware.messaging: true
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
Also I wanted to add that I found another bug.  The process latency metric 
is always right about a -1ms.  Not sure what would be causing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik The output collector is wrapped separately by each bolt so it 
will have no effect.


https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/topology/BasicBoltExecutor.java#L44

I am not saying that we cannot make the change.  I just want us all to 
understand the consequences of these changes.  Removing the synchronization 
from the shuffle grouping  will make it so this is no longer a rolling upgrade. 
 To me the cost just does not offset the reward for doing so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach all of those tests look good to me.  I just want to be sure that 
we do them in both a multi-worker and multi-node setup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-10 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 Two threads spun up by the same bolt instance, if they put a 
`synchronized(collector)` around the collector.emit(),  there will be only one 
emit active at once. So such wrapping the collector should not be an issue.

Thanks for that fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-09 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 here is the initial test cases we are looking to run against 
master & STORM-2306. Let us know if you would like to any further cases 

https://docs.google.com/document/d/1trXXK9IfQ1c_Ptq4DoglNhkTvnJK01uoiUutmhTO6CA/edit?usp=sharing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-09 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik `synchronized(collector)` also does not work in all cases 
because we wrap the collector in many cases. 
https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/topology/BasicOutputCollector.java
 is an example of this.

I also put up a pull request to your repo that fixes a number of tests 
along with the __system bolt issue https://github.com/roshannaik/storm/pull/3 
and anything else I come up with I will definitely share.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-09 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
- I am currently working on fixing the UT failures
- **Emit thread safety** : If two bolts are running on same executor 
thread, then there wont be any concurrency. For bolts wishing to do safe 
concurrent emits they should be able easily do that within 
`synchronized(collector){...}`
- I had previously identified testing with more topologies as a task. 
@harshach has also offered to help create a test plan and do some testing. He 
may be able to share that plan document and folks can chime in to see if that 
looks good. That would be in addition to whatever testing you are doing 
independently @revans2. If you wish, you can add your own tests to that 
document as well.

While such testing is very useful to uncover issues and stabilize things, 
it is possible that sometimes the issues discovered are false alarms... so 
would be nice to give me or someone else a chance to have a second look before 
getting alarmed.

If you make any fixes, please do share them with others here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-08 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
I am really nervous about this patch in general and I am still a -1 on it.

For me to remove my -1 at a minimum all of the unit tests have to pass and 
all critical functionality has to be here (the system bolt metrics are being 
reported properly). Beyond that I really want to see some benchmarks that are 
closer to the real world so I can weigh the cost vs the benefits of this change 
in an analytical way.

Right now my gut is telling me that this is too much change all at once for 
a performance gain in a single micro benchmark that I believe will have a 
minimal impact at best on real topologies, and at a cost that I am not sure I 
or my customers are willing to pay.  

I would love to be proven wrong and see that everything just gets better 
with this patch, and the differences in how topologies behave will be small 
enough to help our customers adjust easily.  But I am very skeptical.

I plan on spending the next few weeks fixing this patch (at least enough 
that I can get system metrics out of it and shuffle does not fail all of the 
unit tests), and then try and run what I think is a somewhat representative 
sample of topologies on a cluster to see what the numbers actually are.  I will 
post my results here when I am done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-08 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik removing thread safety from the emits is a bug and breaks 
backwards compatibility.

In the past the shuffle grouping was the only one that was not thread safe. 
 Because of this we documented that users should put their own synchronization 
around calls to emit, but this proved to be incorrect (or more accurately 
insufficient).  The reason is that we can have multiple instances of a bolt or 
spout sharing a single executor thread.  When this happens synchronizing within 
a single bolt

```
synchrronized(this) {
   collector.emit(...);
}
```

is not enough to ensure safety, because `this` is different for each of the 
bolts.  It is non-intuitive, but the only way to really ensure it is safe would 
be to synchronize on a static singleton lock.  Which is also bad because it 
synchronizes all of the instances of that bolt in the worker, not just the ones 
that would be under contention. By the way the ShellBolt emits on a separate 
thread, just one thread so it was written assuming that it would be safe 
without synchronization but under this proposal it would require us to change 
that.

I made the shuffle grouping thread safe to fix some of these issues, and to 
make it intuitive for our users too.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-08 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
I second @roshannaik. We're exposing possibilities as much as possible so 
that users can compose users' topologies flexible, but as a trade-off we are 
giving up chances for improving performance. 

Yes it is clearly breaking change so I would like to hear voices on users 
side, but if we can provide workaround (like thread-safety output collector 
wrapper, regardless of its performance) I think we can decide and go on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-08 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
**Thread Safety and Bolt/Spout Emits:** @HeartSaVioR & @satishd brought up 
this topic that is worth discussing in the interest of communicating clearly to 
users. 

The basic issue is ...  there is a possibility that some bolts/spouts might 
try to spin up threads internally and perform concurrent emits from those 
threads. For such cases, we need to be clear if we guarantee if the emit path 
thread safe or not. 

**My thoughts:** We should not guarantee that OutputCollector path is 
thread safe. If users chose to spin up threads in their spouts/bolts, they need 
to handle synchronization as well. The advantage of doing so is that we get a 
much faster path for majority of the spouts and bolts which don’t do this 
kind of stuff.

On the other hand, if we support unsynchronized concurrent emits, then much 
of that emit path needs to be slowed down with internal synchronization locks 
(ensuring thread safety all the way through executor.transferLocal, 
executor.transferRemote, batching logic, all groupings etc).  

As far as I can tell, neither the existing code base nor STORM-2306 code 
base support such unsynchronized concurrent emits. But its good to be explicit 
about that to users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-08 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
**FYI:** After rebasing to the latest master I see a drop in peak 
throughput from the original ~8.1mill/sec  to ~7.2mill/sec. Lots of changes 
have come in since the original version of the master I had based upon. So 
unclear what might be responsible for that drop. Will have to defer that 
analysis for later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-07 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
FYI: Regarding improving LoadAwareShuffleGrouping, I spent some days and 
published my improvement on #2261. From my latest update, compared to current, 
#2261 is about 4x faster with single-thread benchmark, and 2x faster with 
two-threads benchmark.
I need to run same benchmarks against ShuffleGrouping but I guess #2261 
should be close to current ShuffleGrouping, based on fact that chooseTasks() 
now does the same thing which ShuffleGrouping is doing.
(ShuffleGrouping in current PR is optimized for single-thread and drops 
thread-safety, so can't compare them directly.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-04 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
Have updated the PR with these 2 major changes in addition to addressing 
many of the smaller TODOs I had. 

1) **Introduced Wait Strategy** -  For these two cases: Bolt having no 
incoming data. And for BackPressure mode when spout/bolt cant write to 
downstream bolt's full queue. The wait strategy implementation for the case 
when spout has no data to emit remains unchanged. The new "Progressive" Wait 
strategy is a dynamic strategy that can go into deeper wait states 
progressively. It can be tuned to be more/less conservative  ... if the user 
desires to make different tradeoffs in latency, CPU usage and throughput.

2) **Changes to default settings:** - So far my defaults where aggressively 
geared towards optimizing high throughput cases.  Based on the concerns I have 
received, I have tweaked them to favor tight latency(batchSz=1), and also 
conserving cpu usage under low/medium throughput(SleepStrategy Settings).  So 
high throughput modes may see some impact depending on topology.

@revans2 To summarize the problems you have brought up so far:

- **At low throughput, latency is poor**  This issue was do due to defaults 
not being suited for these modes (primarily flushtuple period=5seconds). New 
defaults should work better. For High throughput modes, batching  & flushtuple 
rate will need to be customized.
- **At low throughputs, high CPU usage**: This was due to the missing wait 
strategy. That should also get addressed by the introduction.



I want to make a few points to be considered for test runs:
-  Ideally, settings need to be tweaked appropriately for low throughput 
v/s high throughput vs latency centric runs. Using the same defaults for all 
will yield suboptimal numbers on some end of spectrum. My little tips for this 
PR: For higher throughput modes:, batchSz=~1000, additional ackers and finally 
consider less frequent flushing if its not impacting latency.

- My rule-of-thumb for optimal executor count : If spout/bolt executors are 
likely to be very busy (high throughput modes), then 1 executor per physical 
core is a good sweet spot for overall perf. ACKers (if enabled) and also 
WorkerTransferThd (in multiworker mode) would require their own core as they 
tend to stay very busy. 

- max.spout.pending will be needed in high-throughput multi-worker runs due 
to the netty client issue raised by Bobby.

- Bottlenecks: ACKer is a known throughput bottleneck. The inter-Worker 
path bottlenecks around ~2.8 mil/sec with STORM-2306.

- Worth trying some runs with ACK disabled as well as Back Pressure mode.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-08-01 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 
- Perhaps best to wait till I introduce the sleep strategy before  
retry-ing those lower throughput runs. This is my top priority right now.
- Great inputs on the max.spout.pending. My thoughts on the same:

1. Yes. We should not remove it until netty issue is fixed. Good to aim for 
fixing this in Storm 2.0 as max.spout.pending is only useful in ACK mode.
2.  Not clear about this issue
3. Yes the LoadAware model makes sense from a logical standpoint. Its perf 
bothered me enough that and I felt it might be hurting as well as helping the 
larger cause. I anticipated some "backpressure" on this decision. I am ok to 
re-enable it and document this as something to consider for perf tuning. For 
some of this perf testing, IMO we need  be conscious when to enable/disable it. 
Would be great if someone can look into improving its perf.
4. Will address the timers issue in a separate post.
5. Unclear about some aspects of this issue, but deprecation of cyclic 
topos seems reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-31 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik sorry I just reread your other post and it looks like I missed 
some of your questions.

`topology.producer.batch.size` and `topology.flush.tuple.freq.millis` are 
what ever the default values are.  I didn't modify them.  I am happy to explore 
the settings there too, but I just have not spent time on it, it just may take 
me a few days as I have other work to do too.

I agree that when we throttle the spout it is not able to truly gauge the 
maximum throughput possible.  TVL was not designed for that is why the latency 
measurements are based off of a simulation of when a message would have been 
inserted into something like kafka instead of when the message arrived in the 
spout.  If you do set the desired throughput higher than what the topology can 
handle you are able to get an idea of what the maximum throughput is, but other 
measurements are likely to be off so it is of limited value for what I was 
trying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-31 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
Issues that need to be addressed to remove `max.spout.pending` (sorry about 
the wall of text).

1. Worker to Worker backpressure.  The netty client and server have no 
backpressure in them.  If for some reason the network cannot keep up the netty 
client will continue to buffer messages internally in netty until you get 
GC/OOM issues like described in the design doc.  We ran into this when using 
the ZK based backpressure instead of `max.spout.pending` on a topology with a 
very high throughput and there was a network glitch.
2. Single choke point getting messages off of a worker.  Currently there is 
a single disruptor queue and a single thread that is responsible for routing 
all of the messages from within the worker to external workers.  If any of the 
clients sending messages to other workers did block (backpressure) it would 
stop all other messages from leaving the worker.  In practice this negates the 
"a choke up in a single bolt will not put the brakes on all the topology 
components" from your design.  And as the number of workers in a topology grows 
the impact when this does happen will grow too.
3. Load aware grouping needs to be back on by default and probably improved 
some.  Again one of the stated goals of your design for backpressure is "a 
choke up in a single bolt will not put the brakes on all the topology 
components".  Most of the existing groupings effectively negate this, as each 
upstream component will on average send a message to all down stream components 
relatively frequently.  Unless we can route around these backed up components 
one totally blocked bolt will stop all of a topology from functioning.  If you 
say the throughput drops by 20% when this is on, then we probably want to do 
some profiling and understand why this happens.  
4. Timer Tick Tuples.  There are a number of things that rely on timer 
ticks.  Both system critical things, like ackers and spouts timing out tuples 
and metrics (at least for now); and code within bolts and spouts that want to 
do something periodically without having a separate thread.  Right now there is 
a single thread that does these ticks for each worker.  In the past it was a 
real pain to try and debug failures when it would block trying to insert a 
message into a queue that was full.  Metrics stopped flowing.  Spouts stopped 
timing things out, and other really odd things happened, that I cannot remember 
off the top of my head.  I don't want to have to go back to that.
5. Cycles (non DAG topologies).  I know we strongly discourage users from 
building these types of things, and quite honestly I am happy to deprecate 
support for them, but we currently do support them.  So if we are going to stop 
supporting it lets get on with deprecating it in 1.x with clear warnings etc.  
Otherwise when this goes in there will be angry customers with deadlocked 
topologies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-31 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik I am happy to retry it with max spout pending disabled, but in 
my testing I found that disabling it negatively impacted the performance. (My 
initial tests prior to modifying TVL to have lower parallelism) showed that it 
was having a lot of trouble with GC slowing it down.  It could not handle 
150,000 sentences per second, and would max out at about 120,000 to 130,000

```
15 1 -c topology.workers=1 -c topology.acker.executors=2
```

But when I added in a maximum of 500

```
15 1 -c topology.workers=1 -c topology.acker.executors=2 -c 
topology.max.spout.pending=500
```

it was able to easily keep up.

Also later on I was trying to tune it to an optimal value, and I tried 
several different values for it.

```
30 1 -c topology.workers=1 -c topology.acker.executors=1 -c 
topology.max.spout.pending=1000 3 wc-test 1 1 1
```
which maxed out the throughput at abut 230,000 sentences per sec but 
setting it to 2000

```
30 1 -c topology.workers=1 -c topology.acker.executors=1 -c 
topology.max.spout.pending=2000 3 wc-test 1 1 1
```

dropped that maximum to 100,000. At this time I didn't spend the time to 
really dig in and see what the bottleneck was, like I did before so I cannot 
say if it was GC or not.

I am also opposed to removing `max.spout.pending` entirely until several 
issues with its removal can be addressed, but I'll address that in a separate 
post as it is kind of long and complicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-29 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 , Much thanks for spending your cycles on this kind of deep 
testing. Its really valuable to build confidence and achieve the stability we 
desire.

 I see one key issue with this latest test run.  Please disable 
max.spout.pending for 2306.It is not needed anymore and will be removed shortly 
in my updates to this PR. Noted in design doc.  I think your values of 1000 & 
1500 are constraining its throughput.  You can set the max.spout.pending to 
whatever suits best for the master.

- What were the values used for **topology.producer.batch.size** &
**topology.flush.tuple.freq.millis** in the 2306 runs ?  For low throughput 
testing, batchSize of 1 is the right setting for 2306.

Suggestion...Since ACKer is a known bottleneck, I think ... 2 acker, 1 
spout 1 split bolt 1 count might be a more suitable version of 'flavor A' for a 
4 core machine. 

The Throughput vs CPU is worth re-measuring once I provide the fix for the 
busy polling.

I am in agreement with all that you said in that post about user's thought 
process and TVL's use case. For 2306, I find TVL unsuitable for gauging 
throughput limits due to the spout throttling. But it appears, that it is not 
your goal at the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-28 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
My benchmark results with Throughput Vs Latency.

A side note on testing methodology.  I strongly disagree with some of the 
statements made about testing methodology.  I am happy to have a discussion 
with anyone about why Throughput Vs Latency is the way it is.  If we want to 
file a separate JIRA, resolve these differences of opinion and come up with a 
preferred methodology I am happy to.  Just so that others understand my 
methodology I want to summarize it here.  An end user usually knows a few 
things about what they want to build.

1. An estimated throughput range that the topology will need to handle.
2. A budget
3. An target latency after which the data is not good any more (or perhaps 
more accurately a function that describes the drop in value for the data as it 
ages aka an SLA/SLO).

As a user I want to be able to measure my topology and see what it is going 
to cost me to achieve both 1 and 3, and if not what do I need to adjust to make 
this work i.e. raise the budget because the data is important or live with 
longer latency because the value does not drop off too fast.  The Throughput Vs 
Latency test is not intended to measure the maximum throughput that a topology 
configuration can handle.  It is intended to be a building block where you run 
it multiple times varying the throughput and measuring the cost (CPU/Memory) 
and latency at each throughput level.  I'll leave it at that for now.

For this test I ran these on the same laptop I listed 
[above](https://github.com/apache/storm/pull/2241#issuecomment-318102321).  I 
ran using 2 flavors of topology.  I modified Throughput Vs Latency to let me 
set the exact parallelism of all of the components.

 * Flavor A has 1 acker 1 spout 1 split bolt 1 count bolt and a max spout 
pending set to 1000.  This was optimized for maximum throughput under STORM-2306
 * Flavor B has 2 ackers 2 spouts 2 split bolts 3 counts bolts and a max 
spout pending set to 1500.  This was optimized for maximum throughput under 
master (450ed63)

I ran all of these at different throughput values and against both versions 
of storm to give a better apples to apples comparison.  If a topology could not 
keep up with the desired throughput I threw out the results, as the latency and 
CPU used reported are invalid for that throughput.  For these tests I am only 
using CPU as a measure of cost, because I didn't have a simple way to compare 
memory.  I can if we really want to, but I didn't want to have to parse the gc 
log files, and under STORM-2306 the system bolt's metrics are not being 
reported any more which would have collected them automatically for me. For CPU 
I logged the output from top and pulled out the CPU usage when the topology had 
been running for 1 min.

RESULTS:

![chart 
1](https://user-images.githubusercontent.com/3441321/28734894-97e793f6-73a8-11e7-9703-39daed4208d3.png)

As was already discussed the latency at low throughput for STORM-2306 needs 
to be addressed and can be seen here.  But if we zoom into 1 second maximum 
latency.

![chart 
2](https://user-images.githubusercontent.com/3441321/28734943-c4c97254-73a8-11e7-8c08-0f3db98a7bb2.png)

it is clearer to see what is happening at the low end.  I also graphed the 
throughput vs the cost (just CPU)

![chart 
3](https://user-images.githubusercontent.com/3441321/28735082-56aec0b6-73a9-11e7-8a97-dad37cfb3ff6.png)

It is interesting, but I think it is more informative to see it as the 
average cost to process 100 tuples per second.

![chart 
4](https://user-images.githubusercontent.com/3441321/28735110-724a53e4-73a9-11e7-886d-2f9dde991593.png)

Again because of the low throughput issues it is helpful to zoom in on the 
low end.

![chart 
5](https://user-images.githubusercontent.com/3441321/28735127-86a315e2-73a9-11e7-9213-5650d65c9b15.png)

Here is the data that I used.


  | Latency ms | cores CPU | cost/100 tuple/sec | Latency ms | cores CPU | 
cost/100 tuple/sec | Latency ms | cores CPU | cost/100 tuple/sec | Latency ms | 
cores CPU | cost/100 tuple/sec
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
THROUGHPUT | master flavor A | master flavor A | master flavor A | master 
flavor B | master flavor B | master flavor B | STORM-2306 flavor A | STORM-2306 
flavor A | STORM-2306 flavor A | STORM-2306 flavor B | STORM-2306 flavor B | 
STORM-2306 flavor B
500 | 14.26 | 65.90 | 13.18 | 13.20 | 69.24 | 13.85 | 3,978.30 | 137.00 | 
27.40 | 12,029.30 | 350.60 | 70.12
1,000 | 14.38 | 60.20 | 6.02 | 12.76 | 77.69 | 7.77 | 2,206.20 | 130.20 | 
13.02 | 8,975.81 | 367.50 | 36.75
2,000 | 14.52 | 70.31 | 3.52 | 12.58 | 82.89 | 4.14 | 1,163.92 | 134.20 | 
6.71 | 5,075.11 | 352.00 | 17.60
4,000 | 14.91 | 68.09 | 1.70 | 

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-28 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik I appreciate your last comment and trying to summarize the 
concerns that have been raised.

> 1. Better handling of low throughput Topos.

Yes lower CPU usage and lower latency by default.  If all this takes is 
changing some default configs then lets do that.  I am very concerned with 
having something that requires a lot of manual tuning.  Most users will not 
know how to do it,end up copying and pasting something off of the internet and 
get it wrong. That is why I was running my tests with out of the box 
performance.

I also want to be sure that we pay attention to a mixed use case topology 
like with DRPC queries.  You may have one part of your topology that has high 
throughput, aka the data path.  And yet there is another part of the topology 
(DRPC control/query path) that has very low throughput.  Waiting seconds for a 
DRPC query to fill a batch that will never fill is painful.

> 2. TVL topo: Able to run this ...

OK, but to me it was just an indication that something had changed 
drastically and not necessarily in a good way.  My big concern is not TVL.  I 
don't really care much about that (and we can discuss benchmark/testing 
methodologies on a separate JIRA).  It is that with STORM-2306 there are some 
seriously counter intuitive situations (low throughput which you already called 
out) and some really scary pitfalls (the CPU contention which was [mentioned 
above](https://github.com/apache/storm/pull/2241#issuecomment-318494665)).  But 
I want to be sure that it is addressed in some way.  As an end user I see that 
one of my bolts is backed up, so I increase the parallelism and the performance 
gets much worse with no indication at all in any logs or metrics why it got 
worse.  At a minimum we need a good way to know when this is happening, and 
ideally have the performance degrade gracefully instead.

> 3. Bug in Multi worker mode prevents inter-worker communication.

I was wrong this works.  I was just seeing messages time out because of the 
original problem with the host being overloaded and interpreted it wrong.

> 5. Some "real-world topology" runs as in addition to benchmark style 
topos.

Yes and preferably ones that are running on more then one machine.  Ideally 
some that have multiple different topologies running at the same time on the 
same cluster too so we can see what happens when there is CPU contention.  Also 
I would like to add that it would be good to observe someone who has a 
currently working topology try and run it under the new system.  It might help 
us see where we need better documentation or to adjust default settings.

...

> 6. get some more runs of TVL.  I am happy to provide some of that.  I 
spent some time on it the past few days trying to understand better how this 
patch compares to what is on master, but I'll put that in a separate post as 
this is getting long already, and I may have to talk about benchmark 
methodology some.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-28 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
Based on the comments, I have the following summary of concerns raised:
 
1.  Better handling of low throughput Topos: There appear to be two sub 
issues.
 -  CPU usage is high
-  Higher latency was noticed.
 
2. TVL topo: Able to run this topo with similar or better performance 
compared to master. In particular concerns over CPU usage and latencies. Here I 
feel the default settings are not suited for the modes its being run in. But 
overall this is related to pt# 1 IMO
 
3. Bug in Multi worker mode prevents inter-worker communication.
 
4. Concern over TODOs noted in code. These seem to be giving an impression 
that there is lots unfinished work. Actually these are imprtant things to do 
but none of them are big tasks. The intention is to address them all soon in 
this PR. There are maybe two TODOS in code for perf bottlenecks that I dont 
plan to address. They were hard to find, so I felt its valuable to retain them 
alongside the offending line of code. If people prefer to remoes these TODOs, I 
can do that.
 
5. Some "real-world topology" runs as in addition to benchmark style topos.
 
 
We may find more issues as this progresses, but here is my assessment of 
the known pending work:
 
1. Fix the high CPU usage for low throughput topos. I think this needs a 
sleep strategy in the bolt. The spout already has it. Right now it does busy 
polling.
2. Look into (and fix whatever is causing) the higher latencies reported 
for TVL in low throughput mode
3. Fix the multi-worker mode bug
4. Fix all the TODOs noted
5. Get some more TVL runs and share results
6. Some more runs with other topos and share results


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread arunmahadevan
Github user arunmahadevan commented on the issue:

https://github.com/apache/storm/pull/2241
  
Agree with @HeartSaVioR. If possible lets break this down into multiple 
patches like (1) JCQ replacing disruptor (2) changing the threading model (3) 
micro optimizations and so on which makes it easy to review and benchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
Just leaving a note to make my requirements clear (it is quite simple): 

- new system doesn't break anything it worked
  - if they're unavoidable it should be discussed from Storm community, in 
worst case we decide to disallow to break 
- we should provide default values for relevant variables which makes most 
of topologies happy
  - for this patch it should show higher throughput and lower latency 
compared to default of master branch
- (optionally) we may want to provide specific value for them which makes 
benchmark topologies (or user topologies which runs full speed all the time) 
happier

We also may want to guide how parameters work, and how to tune them, and 
starting values for several use cases so that users can tweak their own 
topologies and find good values for them. I guess we didn't do that before, but 
maybe great to have.

Please let me know when this patch achieves my requirements. I'd rather 
treat this as WIP and just wait for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
Btw, I think talking about current state is less meaningful. This patch has 
lots of TODO and some critical identified issues, so it should be addressed, 
and after that the number is going to be really different. We may argue same 
things again and again, so maybe better to hold before this patch becomes 
really ready to review (not WIP).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
I don't think utilizing metrics consumer in TVL is the issue: it might 
matter if results are close so that contributions of other system component 
does matter, but it is just not acceptable latency for low rate. Huge gap 
between twos.

Let's say we get rid of metrics consumer and that makes stable, then are we 
going to pursue users to not use metrics consumer? That doesn't make sense. 
While I don't think so, but if we think metrics consumer contributes throughput 
and/or latency in really odd way, it needs to be validated and fixed.

As you can see my result, CPU was over 100% even with rate 500 and total 
tasks of three key components were 12 (not 48, please keep in mind). All 
results for TVL was captured in that way. So this patch shows high CPU usage in 
baseline (say minimal load) and shows fluctuation by 80% all over rate 1, 
whereas master branch was 20%.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 

About that  "better than sliced bread" : 
how could i not be offended.. at least briefly  ;-) but you could buy me 
lunch if this PR turns out better than you were initially afraid of. It  was 
perhaps a very "low latency" -1 for any PR in the history of Storm. :-) 

Yes there are rough edges and some bugs... but I do dream of being able 
nail it all the way in one go. 

Your observation about the **very high latency for low throughput** topos. 
That is clearly a problem with batch not filling up and not getting flushed. 
The 5 sec latency corresponds to the 'topology.flush.tuple.freq.millis' setting 
(default 5sec). So at each step between Spout->Bolt and Bolt->Bolt if the its 
waiting for 5 sec then you are likely to see such ridiculous latency numbers.

Given that, I think the solution must be evident by now ...  but will state 
it here for the benefit of other readers to whom it may not be:

Tweak one or both of  these two settings:
- **topology.producer.batch.size** : for low throughput topos setting this 
to 1 is a good idea. In the new system the throughput penalty is not that for a 
small batch much compared to larger batch size.  
- **topology.flush.tuple.freq.millis**: You could reduce this for more 
frequent batch flushing. It causes the timer thread to take up more cpu... but 
if throughput fluctuates between very and very low over the day, then is 
setting maybe better to meet latency SLA.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 I am trying to reproduce the worst-case in your last chart. 
Running TVL topology with 4 spout, 10 splitters, 4 counters, 2 ackers.  Here is 
the code
https://gist.github.com/harshach/73dae347c178ac5dd8651cb0e7902412
Running it via following command against Master and STORM-2306 
`/bin/storm jar /tmp/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency 500 1 -c topology.workers=1 -c 
topology.max.spout.pending=500 -c topology.acker.executors=2`

You can look at my results here 
https://docs.google.com/spreadsheets/d/1wPpC3YXp-vTIelRTUVoLxuxIYxUekiIHUpZ2ZEysC4Y/edit#gid=1239810430
 in **sheet 2**
What I see not much difference between Master and STORM-2306. Let me know 
if I am missing something in running this test.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 I am trying to reproduce the worst-case in your last chart. 
Running TVL topology with 4 spout, 10 splitters, 4 counters, 2 ackers.  Here is 
the code
https://gist.github.com/harshach/73dae347c178ac5dd8651cb0e7902412
Running it via following command against Master and STORM-2306 
`/bin/storm jar /tmp/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency 500 1 -c topology.workers=1 -c 
topology.max.spout.pending=500 -c topology.acker.executors=2`

You can look at my results here 
https://docs.google.com/spreadsheets/d/1wPpC3YXp-vTIelRTUVoLxuxIYxUekiIHUpZ2ZEysC4Y/edit#gid=1239810430
 in **sheet 2**
What I see not much difference between Master and STORM-2306. Let me know 
if I am missing something in running this test.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
Some points covering prev comments by @HeartSaVioR and @revans2 

**Throughput limiting:** That only makes sense if you are measuring 
Throughput vs CPU/other resource usage.  Latency measurements do not need it. 
And its a sin if you are doing that when trying to measure throughput.

**TVL topology:** 
- Given its rate limiting nature, it definitely does not have the right 
name. Its employment of very high threads counts and rate limiting spouts 
appear to be tuned to work within the limitations of the current msging system 
and target the old sweetspot. Deserves a question.  Harsha's measurements 
(which are more sensible in terms of executor counts), shows that the current 
msging was brought down to its knees very quickly once the rate limiting went 
away.  


@revans2 
The drop you are seeing with the increased in splitter counts is indicative 
of the increased CPU contention going on even when not enough data flowing 
through an executor (the issue you initially brought up... of high CPU usage 
for idle topos).  The old system, executor seems to be spending more time 
sleeping when there is insufficient data flow and less CPU contention and 
adding redundant/idle executors is not affecting it as much.So you can 
throughput plateaus. 

Lowering the CPU contention for idle mode is something i plan to address... 
and i think have left some TODOs for myself in the code already for to keep me 
honest.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
For the new messaging system.. the scaling rule of thumb I have found so 
far is quite simple. 

For fast topos (and CPU intensive topos) ... 1 executor thread per 
*physical core*.  It applies to ACKer executors as well. Avoid trying to max 
out on logical cores / hyperthreads.

You are likely to be close to getting the most out of your hardware with 
that rule. You can start with that and try adding/removing one or more 
executors to see if you can squeeze more.  

The older system will typically need more executors per machine to get 
similar numbers (throughput usage).. but  throughput may not come close to 
the new system.

The rule for executors count v/s CPU cores for the existing msgs system 
seems less simple to me.

Trying to run 51 executors on a 4 core machine will surely be a step 
towards "de-scaling", if there is such a word. It is strange that such high 
executor count was useful in the current system.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
I have another chart now showing a comparison between master and this 
branch, just varying the number of splitter bolts in the topology.  There are 2 
ackers, 4 spouts, and 4 count bolts all within a single worker and with a max 
spout pending set to 500.  All of the configs are the defaults, and it is on my 
laptop like before.


![chart](https://user-images.githubusercontent.com/3441321/28689893-de51856c-72dc-11e7-8f5b-8f2a77efdfd3.png)

The scary thing here is that with these changes there is a tiny window 
where you get "good" throughput (lets say above 150k sentences per second) for 
this branch.  The previous branch has a very very wide window.  The thing that 
concerns me the most with the way it is now, is that there will be a lot of 
people who didn't turn the parallelism as low as possible, because it just 
works, and they will all have their topologies go from 180k/sec down to 
50k/sec.  And tuning them all perfectly to balance on that exact parallelism 
for the given heterogeneous hardware that we run on is going to be impossible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
I have run some more tests looking at modifying the parallelism of 
different components.

First I kept the parallelism of everything else at 4 and modified the acker 
count.

![chart_ackers](https://user-images.githubusercontent.com/3441321/28684646-bc460734-72ca-11e7-9434-8bdf2c263cab.png)

I also kept the ackers at 2 spout and count at 4 and modified the splitter 
count

![chart_splitters](https://user-images.githubusercontent.com/3441321/28684647-bc462f5c-72ca-11e7-91f8-0a4e1c748682.png)

The acker drop off at 5 is really scary,  but adding too many splitters 
also shows a lot of problems.  I am going to try something similar without the 
patch for comparison.

Overall the numbers look really good in some situations, but it is really 
easy to slip into much worse territory.  @knusbaum said that he was able to get 
a multi-worker setup to work, so that is something else I want to explore. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
Sorry, but we also need to think about low throughput use cases.  I have 
several that I care about and I am seeing very long latency for low throughput. 
 

min in some cases is 5 seconds, max can be up to 20 seconds, average is 
around 10 seconds and the CPU utilization is 500%.  This too needs to be 
addressed.

```
500 1 -c topology.workers=1
uptime:   30 acked: 4,000 acked/sec: 133.33 failed:0 99%:   
9,923,723,263 99.9%:   9,999,220,735 min:  79,036,416 max:  10,015,997,951 
mean: 5,861,829,371.65 stddev: 2,744,502,279.38 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:   60 acked:15,000 acked/sec: 500.00 failed:0 99%:  
14,646,509,567 99.9%:  14,973,665,279 min:  53,084,160 max:  15,023,996,927 
mean: 7,410,713,531.31 stddev: 3,187,842,885.35 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:   90 acked:16,000 acked/sec: 533.33 failed:0 99%:  
14,747,172,863 99.9%:  14,990,442,495 min:  37,486,592 max:  15,032,385,535 
mean: 7,947,532,282.45 stddev: 3,104,232,967.22 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  120 acked:14,000 acked/sec: 466.67 failed:0 99%:  
14,856,224,767 99.9%:  14,998,831,103 min:  65,208,320 max:  15,023,996,927 
mean: 9,071,752,875.48 stddev: 3,337,053,852.19 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  150 acked:13,000 acked/sec: 433.33 failed:0 99%:  
14,914,945,023 99.9%:  14,998,831,103 min:   4,999,610,368 max:  15,074,328,575 
mean: 10,374,946,814.88 stddev: 2,794,778,136.42 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  180 acked:16,000 acked/sec: 533.33 failed:0 99%:  
14,940,110,847 99.9%:  15,049,162,751 min:   5,007,998,976 max:  15,602,810,879 
mean: 10,539,964,609.74 stddev: 2,796,155,497.39 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  210 acked:15,000 acked/sec: 500.00 failed:0 99%:  
14,881,390,591 99.9%:  14,998,831,103 min:   5,003,804,672 max:  15,015,608,319 
mean: 9,616,077,147.72 stddev: 2,781,415,317.06 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  240 acked:10,000 acked/sec: 333.33 failed:0 99%:  
14,889,779,199 99.9%:  15,007,219,711 min:   5,003,804,672 max:  15,015,608,319 
mean: 9,840,073,724.86 stddev: 2,806,028,726.32 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  270 acked:16,000 acked/sec: 533.33 failed:0 99%:  
17,951,621,119 99.9%:  19,780,337,663 min:   5,003,804,672 max:  20,015,218,687 
mean: 10,556,609,171.18 stddev: 3,010,780,308.43 user:  0 sys:  
0 gc:  0 mem:   0.00
uptime:  300 acked:15,000 acked/sec: 500.00 failed:0 99%:  
14,898,167,807 99.9%:  14,998,831,103 min:  51,445,760 max:  15,023,996,927 
mean: 9,694,508,448.06 stddev: 3,087,190,409.09 user:  0 sys:  
0 gc:  0 mem:   0.00
```

I am fine with the goals and the design work being done for this.   If you 
can do better then the stuff I did for disruptor by all means rip out my code 
and make things better.  The low throughput issue was one I had to fix with my 
initial patches to disruptor.  People do care about this.  I am not trying to 
be a jerk, I am just trying to keep my customers happy, share some of my 
experience doing something similar in the past, and also hopefully make storm 
much much better in the end.

I apologize if  I offended anyone.  It was not my intention, but I really 
was shocked to see a patch everyone was touting as better than sliced bread 
decidedly worse in every way for a topology that worked really well before.  I 
was able to max out the default configuration of a parallelism of 4 at 100,000 
sentences per second fully acked.  The new patch could only handle 1/3rd of 
that, and not when there is more then 1 worker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach 

Reiterating what @HeartSaVioR said about benchmarking.  Most benchmarking 
is done where you push a system to its limits and see what maximum throughput 
it can do.  This is far from what a real user wants.  It looks good for a 
vendor to brag about I can do X but that other vendor over there can only do Y. 
 But it is close to worthless for what real users want to know.

Real users are trying to balance the cost of the system in $ (CPU time + 
memory used become this, how many EC whatever instances do I need), the amount 
of data that they can push through the system and how quickly they can get 
results back.  Each of these variables are reflected by this test.  In most 
cases a user has a set load that they know they get typically, and a reasonable 
guess at a maximum load that they expect to see.  Also most users have a 
deadline by which the data is no good any more, if not they should be using 
batch.  And a budget that they have to spend on this project, if not call me I 
want to work for you and my salary requirements are very reasonable.

You need to give users tools to explore all three, and because the three 
are intertwined you want to be able to hold one or two of the variables 
constant while you measure the others.  Storm currently has no way to set a 
target SLA (I hope to add one eventually), but you can control the rate at 
which messages arrive and the parallelism of the topology, (which reflects the 
cost).  So the goal is to scan through various throughput values and various 
parallelisms to see what the latency is, and what resources are actually used.  
In the read world we would adjust the heap size and parallelism accordingly.

Complaining about a benchmark creating 51 threads relates to the 
parallelism that we want to explore.  If that is what I did wrong in the 
benchmark I am happy to adjust and reevaluate.  I want to understand how the 
parallelism impacts this code.  The thing that concerns me now is that it 
appears that scaling a topology is very different now, and I want to understand 
exactly how that works.

I cannot easily roll out a change to my customers saying things might get a 
lot better or they might get a lot worse.  We need to make it easy for a user 
with a topology that may not have been ideal (but worked well), to continue to 
work well.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
I have updated all the results for TVL with second parameter set to 1. Also 
added rate 5.
The CPU usage from current master doesn't fluctuate from all of rates, even 
5, whereas with this patch the CPU usage sometimes fluctuate around 100%.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
Let me share a quick test result with passing `1 1` to TVL parameter:

> STORM-2306

```
uptime:   30 acked:   144,000 acked/sec:   4,800.00 failed:0 99%:   
3,070,230,527 99.9%:   3,221,225,471 min:  63,897,600 max:   3,380,609,023 
mean: 1,299,365,069.36 stddev:  685,287,508.16 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:   60 acked:   303,000 acked/sec:  10,100.00 failed:0 99%:   
3,011,510,271 99.9%:   3,200,253,951 min:  28,540,928 max:   3,303,014,399 
mean: 1,283,728,691.41 stddev:  671,791,145.42 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:   90 acked:   297,000 acked/sec:   9,900.00 failed:0 99%:   
3,047,161,855 99.9%:   3,307,208,703 min:  62,980,096 max:   3,737,124,863 
mean: 1,283,141,447.64 stddev:  675,126,086.16 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  120 acked:   303,000 acked/sec:  10,100.00 failed:0 99%:   
3,047,161,855 99.9%:   3,206,545,407 min:  31,965,184 max:   3,347,054,591 
mean: 1,284,140,763.79 stddev:  690,625,730.54 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  150 acked:   299,000 acked/sec:   9,966.67 failed:0 99%:   
3,072,327,679 99.9%:   3,231,711,231 min:  16,703,488 max:   3,414,163,455 
mean: 1,320,620,493.23 stddev:  693,327,734.87 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  180 acked:   300,000 acked/sec:  10,000.00 failed:0 99%:   
3,042,967,551 99.9%:   3,248,488,447 min:  48,005,120 max:   3,846,176,767 
mean: 1,313,068,274.86 stddev:  671,810,427.83 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  210 acked:   301,000 acked/sec:  10,033.33 failed:0 99%:   
3,061,841,919 99.9%:   3,363,831,807 min:  51,347,456 max:   3,802,136,575 
mean: 1,297,807,219.57 stddev:  678,980,965.35 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  240 acked:   301,000 acked/sec:  10,033.33 failed:0 99%:   
3,019,898,879 99.9%:   3,208,642,559 min:  36,962,304 max:   3,363,831,807 
mean: 1,315,037,518.24 stddev:  676,620,121.79 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  270 acked:   297,000 acked/sec:   9,900.00 failed:0 99%:   
3,026,190,335 99.9%:   3,200,253,951 min:  52,363,264 max:   3,349,151,743 
mean: 1,308,161,023.51 stddev:  680,121,348.29 user:  0 sys:  0 
gc:  0 mem:   0.00
uptime:  300 acked:   300,000 acked/sec:  10,000.00 failed:0 99%:   
3,021,996,031 99.9%:   3,200,253,951 min:  49,348,608 max:   3,317,694,463 
mean: 1,335,928,012.31 stddev:  667,642,145.32 user:  0 sys:  0 
gc:  0 mem:   0.00
```

CPU usage was around 150 ~ 250%, mostly around 160% which seemed to be a 
bit more stable, but still fluctuating with small rate.

> current

```
uptime:   30 acked:   140,440 acked/sec:   4,681.33 failed:0 99%:   
   14,016,511 99.9%:  26,558,463 min:   2,449,408 max:  52,035,583 
mean:7,800,556.68 stddev:1,790,982.79 user: 28,620 sys:  2,340 
gc:  0 mem:  49.94
uptime:   60 acked:   301,860 acked/sec:  10,062.00 failed:0 99%:   
   11,141,119 99.9%:  15,351,807 min:   3,233,792 max:  26,181,631 
mean:7,479,081.72 stddev:1,175,253.40 user: 30,270 sys:  6,800 
gc:190 mem:  54.88
uptime:   90 acked:   301,600 acked/sec:  10,053.33 failed:0 99%:   
   10,813,439 99.9%:  13,197,311 min:   3,246,080 max:  16,138,239 
mean:7,375,841.06 stddev:1,112,541.35 user: 31,660 sys:  7,160 
gc:194 mem:  54.68
uptime:  120 acked:   301,460 acked/sec:  10,048.67 failed:0 99%:   
   11,042,815 99.9%:  13,828,095 min:   3,266,560 max:  17,285,119 
mean:7,400,672.94 stddev:1,130,409.32 user: 29,650 sys:  7,330 
gc:200 mem:  47.80
uptime:  150 acked:   301,500 acked/sec:  10,050.00 failed:0 99%:   
   10,911,743 99.9%:  13,246,463 min:   3,248,128 max:  15,654,911 
mean:7,399,041.82 stddev:1,118,368.85 user: 29,920 sys:  7,370 
gc:199 mem:  41.95
uptime:  180 acked:   301,540 acked/sec:  10,051.33 failed:0 99%:   
   10,969,087 99.9%:  13,598,719 min:   3,233,792 max:  16,302,079 
mean:7,390,435.62 stddev:1,129,976.16 user: 29,840 sys:  7,190 
gc:201 mem:  41.16
uptime:  210 acked:   301,540 acked/sec:  10,051.33 failed:0 99%:   
   11,182,079 99.9%:  14,557,183 min:   3,246,080 max:  19,513,343 
mean:7,382,121.55 stddev:1,161,863.92 user: 29,620 sys:  7,460 
gc:   

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach The second argument is effectively representing worker count: you 
can see that topology set worker count as parallelism. I agree that the name is 
really misleading even I ran tests with topology.workers instead of passing 
second argument. (need to run test again...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@HeartSaVioR Its not 12 executors per worker.  If you don't pass a 
command-line argument, it sets parallelism variable here to 4 
https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L277
and multiplys with 4 here again 
https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L359
 . So setting a parallelism unit 16 per component. 
This is nothing to do with how many workers you've.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread Alexandre Vermeerbergen
Hello Roshan,

Thank you for your detailed answer. Details are important, because in my
organization, I am often asked to re-assess the reasons why we chose Storm
over its competitors.

Best regards,
Alexandre Vermeerbergen


2017-07-25 23:36 GMT+02:00 roshannaik :

> Github user roshannaik commented on the issue:
>
> https://github.com/apache/storm/pull/2241
>
> @avermeer Looks like SuperChief blog is relaying the same basic claims
> that Heron has marketed. Since you ask, i will share my opinions wrt
> Heron's claims.
>
> - Heron has never been a player in the high performance club. They
> have been smart about not comparing themselves with the real top performers
> of the day. I only included them here because they have built they have
> made much noise against Storm. They are smart about not mentioning which
> version of Storm they are comparing with (how does a paper with such
> critical info missing get accepted ?). That creates an illusion in people
> that their perf claims apply to all versions of Storm in general... even if
> Storm [publishes new perf numbers](hortonworks.com/blog/
> microbenchmarking-storm-1-0-performance/) comparing itself to a prior
> version.
> - Heron's threading model (1 thread per process.. based on what i
> gather from their articles), is really primitive for this application
> domain.  I don't recommend it, but by setting 'topology.workers' equal to
> the number of spout& bolt instances, Storm can be run in Heron mode.
> -  I find it much easier to debug a process with multiple components
> using a debugger rather start a separate debugger for every instance of
> spout bolt running. Also, I would imagine, having so many processes means
> you have an explosion of log files to deal with when triaging.
> - Unclear why the recovery model (when worker process crashes) is any
> better ... the same kind of replay from the spout would be required. The
> gains may be minor if any. Making minor optimizations to the failure path
> and penalizing the normal operation path... is backwards.
> - Cant get a stack from a Storm worker ? Thats clearly false. Try it
> yourself. I do it all the time. Heapdumps, on the other hand, can stall the
> worker and if the heap size is really large the supervisor might feel the
> worker is having a problem. There are timeouts that you can increase to for
> the supervisor to wait longer. I cant imagine that Heron doesn't monitor
> their workers and restart them if they are not responsive.
> -  Heron's Backpressure model is simply too overweight, but marketed
> as a novel idea.
> - A quick read of their latest perf blog, noted in the comparison, and
> it was evident that they missed recognizing their real perf problem.
>
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach 
For ThroughputvsLatency, throttling spout is intended. We set desired 
throughput and see histogram of latency and other metrics. (CPU, GC, etc.) 
There're 3 components in topology which parallelism would be set to 4 * worker 
count so total 12 executor threads per worker. I think we can parameterize the 
magic number 4 and adjust it while testing too.

I have also done with some performance tests, without modifying TVL 
topology. The reason is that we should also care about 
non-performance-maximized topology. For benchmarking performance maximized 
topology we also have ConstSpoutIdBoltNullBoltTopo, so let's not modify TVL and 
verify this patch works with all the cases.

Since this patch doesn't seem to handle inter-worker communication 
properly, the test set what we can do for now is very limited.

Here's my machine spec used for performance test:

```
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
```

```
OS: Ubuntu 17.04
CPU: AMD Ryzen 5 1600 3.2Ghz 6 core (with hyper-thread = 12 logical cores)
RAM: Samsung DDR4 32G 19200
SSD: Samsung 850 Evo
```

and here's my number (just pasted as raw number):

https://docs.google.com/spreadsheets/d/1J3S4R68CsazlINF60rQn4XCy2Hx5QNzqVoWSO9qo2tc/edit?usp=sharing

My observation is that this patch looks impressive with performance 
maximized topology, but this also looks really bad (not acceptable) with 
relatively idle topology. I've observed all the things what @revans2 observed 
with TVL tests. But this patch looks stable with ConstSpoutIdBoltNullBoltTopo 
and even CPU usage seems lower than stock in this test.

While we often publicize micro-benchmark result, in practice users would 
run much idle topologies.
I'm OK if things can be stabilized with adjusting parameters (if then I 
think default value should be here), but if not, it should be addressed before 
accepting the patch. I would be -1 if TVL result is not stable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-27 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 @HeartSaVioR 
Here are my findings 
https://docs.google.com/spreadsheets/d/1wPpC3YXp-vTIelRTUVoLxuxIYxUekiIHUpZ2ZEysC4Y/edit#gid=1644511...


1. Looking at ThroughputvsLatency I found some issues:
 - By default it adds 51 total threads , that IMO is incorrect when 
benchmarking in a 4-core machine. 
 
 - Also it adds two bolts for logging/measurements which might be impacting 
the numbers

https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...

https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...
 
 - It also throttles the spout

https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...

I did the following changes:
- Disable the HTTP and Logging bolts
- Disable throttling spout, we want spout to run as fast as it can
- reduced the executor counts

If you see lines from 78 - 102. 

Apache Master clearly couldn't handle the faster spout and starts timing 
out. Perf degrades considerably and very quickly. Where as STORM-2306 not only 
was able to handle the faster spout and delivered stable and processing at more 
start out being 10x faster then improves to 35x faster compared to master.


2. Also ran storm-perf topologies ConstSpoutIdNullBoltIdTopo and 
ConstSpoutNullBoltTopo. These topologies are trying to see whats the message 
throughput and latency when there are only 2 components involved without 
including any external dependencies. Essentially testing the messaging system.

From line 3-45 you can see with this patch we are getting under 10ms 
(depends on the topology) compare to an avg of 250ms+. (with batchSize=1)

3. Also ran storm-examples ThroughputVsLatency with 2 workers. Here there 
is clearly a bug which is prevent inter-worker communication so don't have 
comparative numbers.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@HeartSaVioR I don't mind breaking this into multiple PRs if it helps 
reviewing and merging in. Its up to @roshannaik .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach 
As I commented multiple times, this patch is mixed up of replacing queue, 
changing message subsystem, and micro-optimization. Three committers already 
had a look at the first pass (excluding @revans2 since he stopped reviewing) 
but we couldn't identify any critical problems unless @revans2 shared the test 
result. If we go on with this patch, we even don't know which part of change 
contribute s performance gain and how much it is.

If we were having PR only replacing JCTools, it is easy to review and 
identify the benefit, or pros/cons of replacing Disruptor with JCTools. The 
patch will be fairly simple to review and verify, and most of the issues raised 
here shouldn't be raised there (because I guess most of them are not from queue 
replacement).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@HeartSaVioR lets keep this discussion to reviews. This is not forum to 
discuss what one should tweet or not that's up to individuals. Nobody is trying 
to promote something that's not feasible lets not try to be a moral authority 
here to suggest what one can do or not. 
Regarding breaking this into multiple PRs addressing different subsystems, 
that's a reasonable ask. But lets wait before we go down that path we need to 
look into the issues raised here and reproduce the case.
I am running few tests myself and I'll report my findings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
From now what I can suggest is... breaking down STORM-2306 to "redesign 
messaging subsystem" and "switch to JCTools Queues", and try out latter part 
first.
Let's just have new pull request only replacing Disruptor with JCQueue and 
see it helps. I'd rather lean on putting the minimal effort and make it better 
incrementally. This patch mixes up everything (micro optimization, switching 
queue, messaging architecture change) in one, so hard to point out where the 
issue starts.
And let's separate micro-optimization (converting map usage to array list, 
avoiding iterator, etc.) as another issue and don't address them from other 
issues. I think we should have better throughput even without them if newer 
subsystem is better and/or JCTools Queues are better than Disruptor in our case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
First of all, I think I should be more careful about like / retweet. Sorry 
about that and just reverted all.

Regarding tweet, I think it heavily depends on which accounts (who) tweet 
it.

If users or contributors are tweeting about this PR, we can take it like a 
gossip and no problem on it.

If committers or PMCs are doing it, that could be shown as kind of 
publicizing, especially this PR compares with other frameworks and claims Storm 
will be no 1. performer. If PMC members tweet about unsafe or not yet stable 
source, and if it turned out to another result, someone could feel we (Storm 
community) hype, regardless of intention.
If we say this PR could increase throughput by N times from before, that 
would be less problematic. I don't say no problem, but much less.

If we are doing it with official account (@ApacheStorm), that just matters, 
because the account could be shown as on behalf of Storm community or at least 
PMC members. I was the one wondering why official account retweeted about 
non-Storm related tweet like 
https://twitter.com/BoredElonMusk/status/889935279213223936 . 
If we haven't talked about which things official account should (not) 
tweet, we may need to have a time to talk about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
@harshach I am running with defaults in all cases

I build `mvn clean install -DskipTests`
package `cd storm-dist/binary; mvn clean package`
untar the result `tar -xzvf 
./final-package/target/apache-storm-2.0.0-SNAPSHOT.tar.gz`
move the directory so I can save it if I want to switch back and forth `mv 
apache-storm-2.0.0-SNAPSHOT under-test`
then bring up the cluster
```
cd under-test;
./bin/storm dev-zookeeper &
./bin/storm nimbus &
./bin/storm supervisor &
./bin/storm ui &
./bin/storm logviewer &
```

Wait for everything to come up, and I can see the ui.

Then I run some tests (I pull in the storm starter from the build because 
the packaged one does not build on its own).
```
./bin/storm jar 
../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency
...
./bin/storm jar 
../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency -c topology.workers=1
...
./bin/storm jar 
../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency 1 -c topology.workers=1
...
./bin/storm jar 
../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency 1 -c topology.workers=1 -c 
topology.max.spout.pending=500
...
./bin/storm jar 
../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar 
org.apache.storm.starter.ThroughputVsLatency 1 -c topology.workers=1 -c 
topology.max.spout.pending=1000
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread harshach
Github user harshach commented on the issue:

https://github.com/apache/storm/pull/2241
  
@revans2 Do you mind posting your storm.yaml or are you running with 
defaults. We will try to see if we can reproduce this same behavior on our 
side. If there are any bugs we will work to fix it and but its shows great 
potential on the perf improvements. 
Regarding posting to twitter , Yes we are very excited about the patch and 
definitely want to share the results with the community. Not sure why you are 
getting upset about it. Its important that we make these perf improvements and 
also let the community know that there are continuous improvements in Storm. If 
you found a bug thats great thats why we've PR and review process in place. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
I tried on Linux too and got very similar results.  The CPU and memory 
usage of the topology was lower but the actual throughput and latency of the 
topology was very similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
I ran again with this exact version 
(5c0db923ecd8e4e1ce0e325ee2fd0f25bae7b0c2) and got the same results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2241
  
-1

Perhaps I am running into some odd issues here so if I can be corrected I 
would be happy to change my vote, but nothing I have run with this patch is 
better in any way.  Are all of the results from micro benchmarks?  Did anyone 
run a real topology with this patch before posting all of these wonderful 
results to twitter?  I am not one to swear but WTF?

I built a stock 2.0.0-SNAPSHOT build 
(450ed637f92c3f387681a47b4b667f17eeecac1f) and compared it to the exact same 
release with this patch merged on top of it (which was a clean merge).  I am 
running

```
$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Sierra 10.12.6
MacBook Pro (Retina, 15-inch, Mid 2015)
2.8 GHz Intel Core i7
16 GB 1600 MHz DDR3
```

I ran the ThroughputVsLatency topology with several different options and 
no changes at all to the default storm.yaml.

With this patch I found that.

1. Running a topology with more than one worker appears to not be able to 
send any messages between workers (or it takes so long most of the messages 
time out). So I switched all of my tests to a single worker.
2. When processing a nearly idle topology (500 sentences/second) the CPU 
utilization was everything that my box could give it.  (8 cores fully utilized) 
compared to about one half of one core used by stock storm.
3. The latency is absolutely horrible.  The minimum latency for a somewhat 
idle topology was 1 to 4 seconds to do a word count.  For a topology processing 
10,000 sentences per second it dropped to 800 ms.  The maximum latency was 15 
seconds for all of these cases.  Compare that to stock storm which has a min 
latency of around 3 to 4 ms for the normal case.
4. The system bolt metrics do not work, or at lest I was not able to get 
any of them back.  I tried to compare memory and CPU usage through top, which 
worked out OK.
5. memory usage is insane.  The resident memory was 2 GB for almost all of 
the workers no matter the throughput.  That is 1 GB more than stock storm for 
the same 10,000 sentences per second.
6. The maximum throughput is about 1/4th what it is on stock storm.  I was 
able to to 100,000 sentences per second on my laptop.  I could do it both with 
a single worker, and with 4 worker (although the CPU usage was higher in the 
latter case). With this patch I was able to do 30,000 sentences per second in 
the best case, but on average it could only do about 25,000.

I am happy to make all of my numbers public.  I also plan to run them on a 
Linux box to see if it is any different. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
Btw, we should be careful if we drop something we provided. This patch 
seems focus to performance optimization (especially micro), and for some spots 
it might not be always better. One example is disabling load aware by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik 
I didn't mean to require additional load on you. I just would like to see 
comparison against Storm itself with suppressing variables, not competing with 
papers, to see impact of this patch clearly.
I'll try to run performance tests (maybe ThroughputVsLatency or something 
in perf module) and share the result.

I'll go through the code first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@avermeer 
As @roshannaik stated, we addressed many parts of performance bottleneck on 
Storm 1.0.0. The paper about Heron and blog post on SuperChief referred old 
version of Storm so it doesn't even hold for 1.0.0.
(Yes I understand they can't pick 1.0.0 to compare since 1.1.0 is released 
at Apr. 2016.)

I think we learnt a lot and adopt some parts of them like back pressure 
from Heron's paper (though we implemented in different way), but personally 
also wonder of their points of performance gaining, especially 1 computational 
thread on 1 worker. Message should be serialized even across the worker in a 
node, and the cost of serialization/deserialization is not that cheap.

I think there's no general way of benchmark in Streaming, so claiming that 
A is N times faster than B could hide too much detail. Most of frameworks has 
their own characteristics hence excels on specific use case or performance 
test. Moreover, developer from A doesn't know about B deeply so most of time 
fails to set variables optimized specifically for the test.

That's why I prefer comparing with our own history much instead of 
comparing it with others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread ptgoetz
Github user ptgoetz commented on the issue:

https://github.com/apache/storm/pull/2241
  
Let's not be too hard on the Heron community. Yes, in the past they've not 
been exactly friendly in terms of marketing and technical claims,  but now that 
Heron is incubating as an Apache project (full disclosure: I'm a mentor), 
there's a chance that that might change. A collaborative relationship with the 
Heron community has potential benefit, an adversarial one not so much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik 
First of all, great work! The numbers are impressive.

Before going into this deeply, I would like to see the comparison between 
current master branch vs this patch (say, before and after) so that we can see 
the impact of the patch clearly. IMO this kind of comparison is required 
basically on every performance patch.

@revans2 brought awesome analysis when introducing disruptor batching. 
https://github.com/apache/storm/pull/765#issuecomment-147124664
https://github.com/apache/storm/pull/765#issuecomment-149987537

It should be great if we can see similar analysis for this patch too, only 
if you don't really mind. You can put your analysis with details on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread roshannaik
Github user roshannaik commented on the issue:

https://github.com/apache/storm/pull/2241
  
@avermeer Looks like SuperChief blog is relaying the same basic claims that 
Heron has marketed. Since you ask, i will share my opinions wrt Heron's claims. 

- Heron has never been a player in the high performance club. They have 
been smart about not comparing themselves with the real top performers of the 
day. I only included them here because they have built they have made much 
noise against Storm. They are smart about not mentioning which version of Storm 
they are comparing with (how does a paper with such critical info missing get 
accepted ?). That creates an illusion in people that their perf claims apply to 
all versions of Storm in general... even if Storm [publishes new perf 
numbers](hortonworks.com/blog/microbenchmarking-storm-1-0-performance/) 
comparing itself to a prior version.
- Heron's threading model (1 thread per process.. based on what i gather 
from their articles), is really primitive for this application domain.  I don't 
recommend it, but by setting 'topology.workers' equal to the number of spout& 
bolt instances, Storm can be run in Heron mode.
-  I find it much easier to debug a process with multiple components using 
a debugger rather start a separate debugger for every instance of spout bolt 
running. Also, I would imagine, having so many processes means you have an 
explosion of log files to deal with when triaging. 
- Unclear why the recovery model (when worker process crashes) is any 
better ... the same kind of replay from the spout would be required. The gains 
may be minor if any. Making minor optimizations to the failure path and 
penalizing the normal operation path... is backwards.
- Cant get a stack from a Storm worker ? Thats clearly false. Try it 
yourself. I do it all the time. Heapdumps, on the other hand, can stall the 
worker and if the heap size is really large the supervisor might feel the 
worker is having a problem. There are timeouts that you can increase to for the 
supervisor to wait longer. I cant imagine that Heron doesn't monitor their 
workers and restart them if they are not responsive.
-  Heron's Backpressure model is simply too overweight, but marketed as a 
novel idea.
- A quick read of their latest perf blog, noted in the comparison, and it 
was evident that they missed recognizing their real perf problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread satishd
Github user satishd commented on the issue:

https://github.com/apache/storm/pull/2241
  
@roshannaik Did you run any topologies with trident topologies in 
distributed mode with these changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

2017-07-25 Thread avermeer
Github user avermeer commented on the issue:

https://github.com/apache/storm/pull/2241
  
Hello Roshan,

Thank you very much for your huge work for improving Storm performances!

Regarding "competitive perf evaluation", would you say that now the reasons
why SuperCheif team moved away from Storm to their homebrewed streaming
processing system in 2015 (see http://blog.librato.com/posts/superchief) no
longer hold?

Best regards,
Alexandre Vermeerbergen

2017-07-25 8:58 GMT+02:00 Satish Duggana :

> *@satishd* commented on this pull request.
>
> Nice Work Roshan!
> Had an initial look at the code and left few comments(mostly minor),
> Overall LGTM.
> --
>
> In conf/defaults.yaml
> :
>
> > @@ -146,7 +149,7 @@ supervisor.run.worker.as.user: false
>  #how long supervisor will wait to ensure that a worker process is started
>  supervisor.worker.start.timeout.secs: 120
>  #how long between heartbeats until supervisor considers that worker dead 
and tries to restart it
> -supervisor.worker.timeout.secs: 30
> +supervisor.worker.timeout.secs: 3
>
> Is this really a deliberate change?
> --
>
> In conf/defaults.yaml
> :
>
> > @@ -253,11 +247,16 @@ topology.trident.batch.emit.interval.millis: 500
>  topology.testing.always.try.serialize: false
>  topology.classpath: null
>  topology.environment: null
> -topology.bolts.outgoing.overflow.buffer.enable: false
> -topology.disruptor.wait.timeout.millis: 1000
> -topology.disruptor.batch.size: 100
> -topology.disruptor.batch.timeout.millis: 1
> -topology.disable.loadaware.messaging: false
> +topology.bolts.outgoing.overflow.buffer.enable: false # TODO: Roshan : 
Whats this ?
> +topology.disruptor.wait.timeout.millis: 1000  # TODO: Roshan: not used, 
but we may/not want this behavior
> +topology.transfer.buffer.size: 5
> +topology.transfer.batch.size: 10
> +topology.executor.receive.buffer.size: 5
> +topology.producer.batch.size: 1000  # TODO: Roshan:  rename
> +topology.flush.tuple.freq.millis: 5000
>
> nit: Better to add a comment describing about this property.
> --
>
> In conf/defaults.yaml
> :
>
> > @@ -304,6 +303,7 @@ storm.cgroup.resources:
>  storm.cgroup.hierarchy.name: "storm"
>  storm.supervisor.cgroup.rootdir: "storm"
>  storm.cgroup.cgexec.cmd: "/bin/cgexec"
> +storm.cgroup.cgexec.cmd: "/bin/cgexec"
>
> may be an accidental copy, needs to be removed.
> --
>
> In storm-client/src/jvm/org/apache/storm/task/IOutputCollector.java
> :
>
> > @@ -30,4 +31,5 @@
>  void ack(Tuple input);
>  void fail(Tuple input);
>  void resetTimeout(Tuple input);
> +void flush();
>
> May want to add some javadoc about the same. It seems we are ready to
> break the APIs with new set of changes in this redesign.
> --
>
> In storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java
> :
>
> >
> -for (Map.Entry entry : 
grouped.entrySet()) {
> -DisruptorQueue queue = 
shortExecutorReceiveQueueMap.get(entry.getKey());
> -if (null != queue) {
> -queue.publish(entry.getValue());
> -} else {
> -LOG.warn("Received invalid messages for unknown tasks. 
Dropping... ");
> +private void transferLocalBatch(List tupleBatch) {
> +try {
> +for (int i = 0; i < tupleBatch.size(); i++) {
>
> Does foreach have significant perf issue?
> --
>
> In storm-client/src/jvm/org/apache/storm/daemon/Acker.java
> :
>
> > @@ -66,6 +67,7 @@ public void prepare(Map topoConf, 
TopologyContext context, Outpu
>
>  @Override
>  public void execute(Tuple input) {
> +long start = System.currentTimeMillis();
>
> nit: start is never used.
> --
>
> In storm-client/src/jvm/org/apache/storm/daemon/GrouperFactory.java
> :
>
> > @@ -137,7 +137,7 @@ public void prepare(WorkerTopologyContext context, 
GlobalStreamId stream, List