[jira] [Created] (STORM-1892) class org.apache.storm.hdfs.spout.TextFileReader should be public
Roshan Naik created STORM-1892: -- Summary: class org.apache.storm.hdfs.spout.TextFileReader should be public Key: STORM-1892 URL: https://issues.apache.org/jira/browse/STORM-1892 Project: Apache Storm Issue Type: Bug Affects Versions: 1.0.1 Reporter: Roshan Naik Assignee: Roshan Naik -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1474) Address remaining minor review comments for STORM-1199
[ https://issues.apache.org/jira/browse/STORM-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1474: --- Issue Type: Sub-task (was: Bug) Parent: STORM-1199 > Address remaining minor review comments for STORM-1199 > -- > > Key: STORM-1474 > URL: https://issues.apache.org/jira/browse/STORM-1474 > Project: Apache Storm > Issue Type: Sub-task > Components: storm-hdfs > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Minor > > Address the last few pending review comments from > https://github.com/apache/storm/pull/936 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-1474) Address remaining minor review comments for STORM-1199
Roshan Naik created STORM-1474: -- Summary: Address remaining minor review comments for STORM-1199 Key: STORM-1474 URL: https://issues.apache.org/jira/browse/STORM-1474 Project: Apache Storm Issue Type: Bug Components: storm-hdfs Reporter: Roshan Naik Assignee: Roshan Naik Priority: Minor Address the last few pending review comments from https://github.com/apache/storm/pull/936 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1199) Create HDFS Spout
[ https://issues.apache.org/jira/browse/STORM-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098948#comment-15098948 ] Roshan Naik commented on STORM-1199: Thanks all for your input/feedback/reviews .. they were very useful. > Create HDFS Spout > - > > Key: STORM-1199 > URL: https://issues.apache.org/jira/browse/STORM-1199 > Project: Apache Storm > Issue Type: New Feature > Reporter: Roshan Naik > Assignee: Roshan Naik > Fix For: 1.0.0 > > Attachments: HDFSSpoutforStorm v2.pdf, HDFSSpoutforStorm.pdf, > hdfs-spout.1.patch > > > Create an HDFS spout so that Storm can suck in data from files in a HDFS > directory -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1526) Improve Storm core performance
[ https://issues.apache.org/jira/browse/STORM-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139711#comment-15139711 ] Roshan Naik commented on STORM-1526: Thanks [~kabhwan] for merging to 1.x also. > Improve Storm core performance > -- > > Key: STORM-1526 > URL: https://issues.apache.org/jira/browse/STORM-1526 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Fix For: 1.0.0 > > > Profiling a Speed of Light toplogy running on Storm core without ACKers is > showing: > - Call tree info : shows that a big part of the nextTuple() invocation is > consumed in the SpoutOutputCollector.emit() call. 20% of it goes in > Reflection by the clojure code > Method Stats view : Shows that a lot of time is spent blocking on the > disruptor queue > The performance issue is narrowed down to this Clojure code in executor.clj : > {code} > (defn mk-custom-grouper > [^CustomStreamGrouping grouping ^WorkerTopologyContext context ^String > component-id ^String stream-id target-tasks] > (.prepare grouping context (GlobalStreamId. component-id stream-id) > target-tasks) > (if (instance? LoadAwareCustomStreamGrouping grouping) > (fn. [task-id ^List values load] > (.chooseTasks grouping task-id values load)); <-- problematic > invocation > (fn [task-id ^List values load] > (.chooseTasks grouping task-id values > {code} > *grouping* is statically typed to the base type CustomStreamGrouping. In > this run, its actual type is the derived type > LoadAwareCustomStreamGrouping. > The base type does not have a chooseTasks() method with 3 args. Only the > derived type has that method. Consequently clojure falls back to > dynamically iterating over the methods in the *grouping* object to locate > the right method & then invoke it appropriately. This falls in the > critical path SpoutOutputCollector.emit() where it takes about ~20% time > .. just locating the right method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (STORM-1526) Improve Storm core performance
[ https://issues.apache.org/jira/browse/STORM-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik reassigned STORM-1526: -- Assignee: Roshan Naik > Improve Storm core performance > -- > > Key: STORM-1526 > URL: https://issues.apache.org/jira/browse/STORM-1526 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > > Profiling a Speed of Light toplogy running on Storm core without ACKers is > showing: > - Call tree info : shows that a big part of the nextTuple() invocation is > consumed in the SpoutOutputCollector.emit() call. 20% of it goes in > Reflection by the clojure code > Method Stats view : Shows that a lot of time is spent blocking on the > disruptor queue > The performance issue is narrowed down to this Clojure code in executor.clj : > {code} > (defn mk-custom-grouper > [^CustomStreamGrouping grouping ^WorkerTopologyContext context ^String > component-id ^String stream-id target-tasks] > (.prepare grouping context (GlobalStreamId. component-id stream-id) > target-tasks) > (if (instance? LoadAwareCustomStreamGrouping grouping) > (fn. [task-id ^List values load] > (.chooseTasks grouping task-id values load)); <-- problematic > invocation > (fn [task-id ^List values load] > (.chooseTasks grouping task-id values > {code} > *grouping* is statically typed to the base type CustomStreamGrouping. In > this run, its actual type is the derived type > LoadAwareCustomStreamGrouping. > The base type does not have a chooseTasks() method with 3 args. Only the > derived type has that method. Consequently clojure falls back to > dynamically iterating over the methods in the *grouping* object to locate > the right method & then invoke it appropriately. This falls in the > critical path SpoutOutputCollector.emit() where it takes about ~20% time > .. just locating the right method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1526) Improve Storm core performance
[ https://issues.apache.org/jira/browse/STORM-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134640#comment-15134640 ] Roshan Naik commented on STORM-1526: [~dossett] i updated the PR title. > Improve Storm core performance > -- > > Key: STORM-1526 > URL: https://issues.apache.org/jira/browse/STORM-1526 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > > Profiling a Speed of Light toplogy running on Storm core without ACKers is > showing: > - Call tree info : shows that a big part of the nextTuple() invocation is > consumed in the SpoutOutputCollector.emit() call. 20% of it goes in > Reflection by the clojure code > Method Stats view : Shows that a lot of time is spent blocking on the > disruptor queue > The performance issue is narrowed down to this Clojure code in executor.clj : > {code} > (defn mk-custom-grouper > [^CustomStreamGrouping grouping ^WorkerTopologyContext context ^String > component-id ^String stream-id target-tasks] > (.prepare grouping context (GlobalStreamId. component-id stream-id) > target-tasks) > (if (instance? LoadAwareCustomStreamGrouping grouping) > (fn. [task-id ^List values load] > (.chooseTasks grouping task-id values load)); <-- problematic > invocation > (fn [task-id ^List values load] > (.chooseTasks grouping task-id values > {code} > *grouping* is statically typed to the base type CustomStreamGrouping. In > this run, its actual type is the derived type > LoadAwareCustomStreamGrouping. > The base type does not have a chooseTasks() method with 3 args. Only the > derived type has that method. Consequently clojure falls back to > dynamically iterating over the methods in the *grouping* object to locate > the right method & then invoke it appropriately. This falls in the > critical path SpoutOutputCollector.emit() where it takes about ~20% time > .. just locating the right method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-1526) Improve Storm core performance
Roshan Naik created STORM-1526: -- Summary: Improve Storm core performance Key: STORM-1526 URL: https://issues.apache.org/jira/browse/STORM-1526 Project: Apache Storm Issue Type: Bug Reporter: Roshan Naik Profiling a Speed of Light toplogy running on Storm core without ACKers is showing: - Call tree info : shows that a big part of the nextTuple() invocation is consumed in the SpoutOutputCollector.emit() call. 20% of it goes in Reflection by the clojure code Method Stats view : Shows that a lot of time is spent blocking on the disruptor queue The performance issue is narrowed down to this Clojure code in executor.clj : {code} (defn mk-custom-grouper [^CustomStreamGrouping grouping ^WorkerTopologyContext context ^String component-id ^String stream-id target-tasks] (.prepare grouping context (GlobalStreamId. component-id stream-id) target-tasks) (if (instance? LoadAwareCustomStreamGrouping grouping) (fn. [task-id ^List values load] (.chooseTasks grouping task-id values load)); <-- problematic invocation (fn [task-id ^List values load] (.chooseTasks grouping task-id values {code} *grouping* is statically typed to the base type CustomStreamGrouping. In this run, its actual type is the derived type LoadAwareCustomStreamGrouping. The base type does not have a chooseTasks() method with 3 args. Only the derived type has that method. Consequently clojure falls back to dynamically iterating over the methods in the *grouping* object to locate the right method & then invoke it appropriately. This falls in the critical path SpoutOutputCollector.emit() where it takes about ~20% time .. just locating the right method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1539) Improve Storm ACK-ing performance
[ https://issues.apache.org/jira/browse/STORM-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143663#comment-15143663 ] Roshan Naik commented on STORM-1539: The attached profiler info related to the amount of time taken by ~45k invocations each of Spout.nextTuple() and Bolt.execute() suggests a perf boost of: *Spout.nextTuple() :* 6953ms -> 5396ms = *~30%* improvement *Bolt.execute() :* 5313ms -> 3687ms = *~ 44%* improvement > Improve Storm ACK-ing performance > - > > Key: STORM-1539 > URL: https://issues.apache.org/jira/browse/STORM-1539 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik >Assignee: Roshan Naik > Attachments: after.png, before.png > > > Profiling a simple speed of light topology, shows that a good chunk of time > of the SpoutOutputCollector.emit() is spent in the clojure reduce() > function.. which is part of the ACK-ing logic. > Re-implementing this reduce() logic in java gives a big performance boost in > both in the Spout.nextTuple() and Bolt.execute() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1539) Improve Storm ACK-ing performance
[ https://issues.apache.org/jira/browse/STORM-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1539: --- Attachment: after.png before.png Attaching before/after screenshots of profiler screenshots > Improve Storm ACK-ing performance > - > > Key: STORM-1539 > URL: https://issues.apache.org/jira/browse/STORM-1539 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: after.png, before.png > > > Profiling a simple speed of light topology, shows that a good chunk of time > of the SpoutOutputCollector.emit() is spent in the clojure reduce() > function.. which is part of the ACK-ing logic. > Re-implementing this reduce() logic in java gives a big performance boost in > both in the Spout.nextTuple() and Bolt.execute() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1632) Disable event logging by default
[ https://issues.apache.org/jira/browse/STORM-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1632: --- Priority: Blocker (was: Major) > Disable event logging by default > > > Key: STORM-1632 > URL: https://issues.apache.org/jira/browse/STORM-1632 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.0 > > > EventLogging has performance penalty. For a simple speed of light topology > with a single instances of a spout and a bolt, disabling event logging > indicates a 7% to 9% perf improvement (with acker count =1) > Event logging can be enabled when there is need to do debug, but turned off > by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1580) Secure hdfs spout failed
[ https://issues.apache.org/jira/browse/STORM-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1580: --- Attachment: HdfsSpoutTopology.java Sorry for the delayed update... kept getting pulled into other urgent things and took sometime to setup a kerberized cluster. *Update:* I modified HdfsSpoutTopology.java (from examples/storm-starter ) for kerberos and tried it on a secure cluster. It worked fine. I am attaching the modified java file. Your error might indicate some issue likely on the kerberos setup side. try these: - kinit with the same keytab and principal on that host and verify its ok by running some hadoop fs -ls commands - Ensure hdfs-site.xml and core-site.xml from the kerberized cluster are packaged as resources in your topology. A quick way to do this is to copy them into storm/lib and restart supervisor. > Secure hdfs spout failed > > > Key: STORM-1580 > URL: https://issues.apache.org/jira/browse/STORM-1580 > Project: Apache Storm > Issue Type: Bug > Components: storm-hdfs >Reporter: guoht > Labels: security > Attachments: HdfsSpoutTopology.java > > > Some error occured when using secure hdfs spout: > "Login successful for user t...@example.com using keytab file > /home/test/test.keytab > 2016-02-26 10:33:14 o.a.h.i.Client [WARN] Exception encountered while > connecting to the server : javax.security.sasl.SaslException: GSS initiate > failed [Caused by GSSException: No valid credentials provided (Mechanism > level: Failed to find any Kerberos tgt)] > 2016-02-26 10:33:14 o.a.h.i.Client [WARN] Exception encountered while > connecting to the server : javax.security.sasl.SaslException: GSS initiate > failed [Caused by GSSException: No valid credentials provided (Mechanism > level: Failed to find any Kerberos tgt)] > 2016-02-26 10:33:14 o.a.h.i.r.RetryInvocationHandler [INFO] Exception while > invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over > hnn025/192.168.137.2:8020 after 1 fail over attempts. Trying to fail over > immediately. > java.io.IOException: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)]; Host Details : local host is: "HDD021/192.168.137.6"; > destination host is: "hnn025":8020;" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1643: --- Summary: Performance Fix: Optimize clojure lookups related to throttling and stats (was: Performance Fix: Optimize clojure lookups related throttling and stats) > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-1643) Performance Fix: Optimize clojure lookups related throttling and stats
Roshan Naik created STORM-1643: -- Summary: Performance Fix: Optimize clojure lookups related throttling and stats Key: STORM-1643 URL: https://issues.apache.org/jira/browse/STORM-1643 Project: Apache Storm Issue Type: Bug Reporter: Roshan Naik Assignee: Roshan Naik -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1580) Secure hdfs spout failed
[ https://issues.apache.org/jira/browse/STORM-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209389#comment-15209389 ] Roshan Naik commented on STORM-1580: [~ght] fyi.. I am beginning to take a look at this. > Secure hdfs spout failed > > > Key: STORM-1580 > URL: https://issues.apache.org/jira/browse/STORM-1580 > Project: Apache Storm > Issue Type: Bug > Components: storm-hdfs >Reporter: guoht > Labels: security > > Some error occured when using secure hdfs spout: > "Login successful for user t...@example.com using keytab file > /home/test/test.keytab > 2016-02-26 10:33:14 o.a.h.i.Client [WARN] Exception encountered while > connecting to the server : javax.security.sasl.SaslException: GSS initiate > failed [Caused by GSSException: No valid credentials provided (Mechanism > level: Failed to find any Kerberos tgt)] > 2016-02-26 10:33:14 o.a.h.i.Client [WARN] Exception encountered while > connecting to the server : javax.security.sasl.SaslException: GSS initiate > failed [Caused by GSSException: No valid credentials provided (Mechanism > level: Failed to find any Kerberos tgt)] > 2016-02-26 10:33:14 o.a.h.i.r.RetryInvocationHandler [INFO] Exception while > invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over > hnn025/192.168.137.2:8020 after 1 fail over attempts. Trying to fail over > immediately. > java.io.IOException: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)]; Host Details : local host is: "HDD021/192.168.137.6"; > destination host is: "hnn025":8020;" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204568#comment-15204568 ] Roshan Naik commented on STORM-1643: :key lookups in Clojure are expensive. And some of keys like :storm-conf are being looked up multiple times. Optimizing these by looking up once and reusing improves performance. Will attach profiler screenshots demonstrating the before and after report. > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204576#comment-15204576 ] Roshan Naik commented on STORM-1643: [~revans2] This is for 1.x. Wrt 2.x... the fixes are small enough that we could optionally choose to not wait for the Java rewrite. > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1643: --- Comment: was deleted (was: :key lookups in Clojure are expensive. And some of keys like :storm-conf are being looked up multiple times. Optimizing these by looking up once and reusing improves performance. Will attach profiler screenshots demonstrating the before and after report.) > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: after.png, before.png > > > :key lookups in Clojure are expensive. And some of keys like :storm-conf are > being looked up multiple times. Optimizing these by looking up once and > reusing improves performance. > Will attach profiler screenshots demonstrating the before and after report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1643: --- Attachment: after.png before.png > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: after.png, before.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1643: --- Description: :key lookups in Clojure are expensive. And some of keys like :storm-conf are being looked up multiple times. Optimizing these by looking up once and reusing improves performance. Will attach profiler screenshots demonstrating the before and after report. > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: after.png, before.png > > > :key lookups in Clojure are expensive. And some of keys like :storm-conf are > being looked up multiple times. Optimizing these by looking up once and > reusing improves performance. > Will attach profiler screenshots demonstrating the before and after report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1643) Performance Fix: Optimize clojure lookups related to throttling and stats
[ https://issues.apache.org/jira/browse/STORM-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1643: --- Attachment: before2.png after2.png Attaching another pair of before2/after2 profiler screenshots highlighting another node in call tree where the perf difference is observed > Performance Fix: Optimize clojure lookups related to throttling and stats > - > > Key: STORM-1643 > URL: https://issues.apache.org/jira/browse/STORM-1643 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: after.png, after2.png, before.png, before2.png > > > :key lookups in Clojure are expensive. And some of keys like :storm-conf are > being looked up multiple times. Optimizing these by looking up once and > reusing improves performance. > Will attach profiler screenshots demonstrating the before and after report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1632) Disable event logging by default
[ https://issues.apache.org/jira/browse/STORM-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1632: --- Description: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. **Update:** with acker=0 the observed impact was much higher... **30%** faster when event loggers = 0 was: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. **Update:** with acker=0 the observed impact was much higher... **30%** faster with event loggers = 0 > Disable event logging by default > > > Key: STORM-1632 > URL: https://issues.apache.org/jira/browse/STORM-1632 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.0 > > > EventLogging has performance penalty. For a simple speed of light topology > with a single instances of a spout and a bolt, disabling event logging > delivers a 7% to 9% perf improvement (with acker count =1) > Event logging can be enabled when there is need to do debug, but turned off > by default. > **Update:** with acker=0 the observed impact was much higher... **30%** > faster when event loggers = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1632) Disable event logging by default
[ https://issues.apache.org/jira/browse/STORM-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1632: --- Description: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. **Update:** with acker=0 the observed impact was much higher... **30%** faster with event loggers = 0 was: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. > Disable event logging by default > > > Key: STORM-1632 > URL: https://issues.apache.org/jira/browse/STORM-1632 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.0 > > > EventLogging has performance penalty. For a simple speed of light topology > with a single instances of a spout and a bolt, disabling event logging > delivers a 7% to 9% perf improvement (with acker count =1) > Event logging can be enabled when there is need to do debug, but turned off > by default. > **Update:** with acker=0 the observed impact was much higher... **30%** > faster with event loggers = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1632) Disable event logging by default
[ https://issues.apache.org/jira/browse/STORM-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1632: --- Description: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. **Update:** with acker=0 the observed impact was much higher... **25%** faster when event loggers = 0 was: EventLogging has performance penalty. For a simple speed of light topology with a single instances of a spout and a bolt, disabling event logging delivers a 7% to 9% perf improvement (with acker count =1) Event logging can be enabled when there is need to do debug, but turned off by default. **Update:** with acker=0 the observed impact was much higher... **30%** faster when event loggers = 0 > Disable event logging by default > > > Key: STORM-1632 > URL: https://issues.apache.org/jira/browse/STORM-1632 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.0 > > > EventLogging has performance penalty. For a simple speed of light topology > with a single instances of a spout and a bolt, disabling event logging > delivers a 7% to 9% perf improvement (with acker count =1) > Event logging can be enabled when there is need to do debug, but turned off > by default. > **Update:** with acker=0 the observed impact was much higher... **25%** > faster when event loggers = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1632) Disable event logging by default
[ https://issues.apache.org/jira/browse/STORM-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1632: --- Attachment: BasicTopology.java uploading topology code to validate perf hit. To use it: .. 1) Copy the java file into examples/storm-starter 2) Rebuild the storm-starter package using mvn. 3) Run topology as follows: storm jar /Users/rnaik/Projects/idea/storm/examples/storm-starter/target/storm-starter-1.0.0-SNAPSHOT.jar -c topology.eventlogger.executors=0 -c topology.max.spout.pending=2000 -c topology.disruptor.batch.size=1storm.starter.BasicTopology and then again with {{topology.eventlogger.executors=1}} I have set those additional 2 flags as they improved performance over the defaults for this topology. I normally let it run for about 11 min and then capture the 10 min window metrics from UI page. > Disable event logging by default > > > Key: STORM-1632 > URL: https://issues.apache.org/jira/browse/STORM-1632 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Reporter: Roshan Naik > Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.0 > > Attachments: BasicTopology.java > > > EventLogging has performance penalty. For a simple speed of light topology > with a single instances of a spout and a bolt, disabling event logging > delivers a 7% to 9% perf improvement (with acker count =1) > Event logging can be enabled when there is need to do debug, but turned off > by default. > **Update:** with acker=0 the observed impact was much higher... **25%** > faster when event loggers = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1772) Create topologies for measuring performance
[ https://issues.apache.org/jira/browse/STORM-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279190#comment-15279190 ] Roshan Naik commented on STORM-1772: Hi [~mauzhang], Yes thats it. I first observed that perf difference issue when working on STORM-1632, but was not able to get to the bottom of it. The storm native topology mentioned here : https://github.com/apache/storm/pull/1217#issuecomment-201074919 I can try to locate the benchmark-specific version of the topology but its a straightforward rewrite. The storm native showed a difference of ~12% when doing a A/B test (with and without the fix) The benchmark specific version of the topology .. it was 25% as noted in the description of STORM-1632. IMO.. briefly ignoring the perf diff issue, it would be good to go ahead and see what we can incorporate from that benchmark . In this jira my goal is to add a few topologies for perf testing... not to create a benchmarking tool/framework itself. In that sense its not conflicting with STORM-642. *side note:* If we are adding a benchmarking framework, it would be good if it can run standard Storm topologies directly and not require topologies to be written specifically for it. > Create topologies for measuring performance > --- > > Key: STORM-1772 > URL: https://issues.apache.org/jira/browse/STORM-1772 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > > Would be very useful to have some simple reference topologies included with > Storm that can be used to measure performance both by devs during development > (to start with) and perhaps also on a real storm cluster (subsequently). > To start with, the goal is to put the focus on the performance > characteristics of individual building blocks such as specifics bolts, > spouts, grouping options, queues, etc. So, initially biased towards > micro-benchmarking but subsequently we could add higher level ones too. > Although there is a storm benchmarking tool (originally written by Intel?) > that can be used, and i have personally used it, its better for this to be > integrated into Storm proper and also maintained by devs as storm evolves. > On a side note, in some instances I have noticed (to my surprise) that the > perf numbers change when the topologies written for Intel benchmark when > rewritten without the required wrappers so that they runs directly under > Storm. > Have a few topologies in mind for measuring each of these: > # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but > no bolts. > # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> > DevNull Bolt > # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt > # *Hdfs Spout:* Hdfs Spout -> DevNull Botl > # *Kafka Spout:* Kafka Spout -> DevNull Bolt > # *Simple Data Movement*: Kafka Spout -> Hdfs Bolt > Shall add these for Storm core first. Then we can have the same for Trident > also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (STORM-1772) Create topologies for measuring performance
[ https://issues.apache.org/jira/browse/STORM-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik reassigned STORM-1772: -- Assignee: Roshan Naik > Create topologies for measuring performance > --- > > Key: STORM-1772 > URL: https://issues.apache.org/jira/browse/STORM-1772 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Assignee: Roshan Naik > > Would be very useful to have some simple reference topologies included with > Storm that can be used to measure performance both by devs during development > (to start with) and perhaps also on a real storm cluster (subsequently). > To start with, the goal is to put the focus on the performance > characteristics of individual building blocks such as specifics bolts, > spouts, grouping options, queues, etc. So, initially biased towards > micro-benchmarking but subsequently we could add higher level ones too. > Although there is a storm benchmarking tool (originally written by Intel?) > that can be used, and i have personally used it, its better for this to be > integrated into Storm proper and also maintained by devs as storm evolves. > On a side note, in some instances I have noticed (to my surprise) that the > perf numbers change when the topologies written for Intel benchmark when > rewritten without the required wrappers so that they runs directly under > Storm. > Have a few topologies in mind for measuring each of these: > # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but > no bolts. > # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> > DevNull Bolt > # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt > # *Hdfs Spout:* Hdfs Spout -> DevNull Botl > # *Kafka Spout:* Kafka Spout -> DevNull Bolt > # *Simple Data Movement*: Kafka Spout -> Hdfs Bolt > Shall add these for Storm core first. Then we can have the same for Trident > also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1772) Create topologies for measuring performance
[ https://issues.apache.org/jira/browse/STORM-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1772: --- Description: Would be very useful to have some simple reference topologies included with Storm that can be used to measure performance both by devs during development (to start with) and perhaps also on a real storm cluster (subsequently). To start with, the goal is to put the focus on the performance characteristics of individual building blocks such as specifics bolts, spouts, grouping options, queues, etc. So, initially biased towards micro-benchmarking but subsequently we could add higher level ones too. Although there is a storm benchmarking tool (originally written by Intel?) that can be used, and i have personally used it, its better for this to be integrated into Storm proper and also maintained by devs as storm evolves. On a side note, in some instances I have noticed (to my surprise) that the perf numbers change when the topologies written for Intel benchmark when rewritten without the required wrappers so that they runs directly under Storm. Have a few topologies in mind for measuring each of these: # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but no bolts. # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> DevNull Bolt # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt # *Hdfs Spout:* Hdfs Spout -> DevNull Botl # *Kafka Spout:* Kafka Spout -> DevNull Bolt # *Simple Data Movement*: Kafka Spout -> Hdfs Bolt Shall add these for Storm core first. Then we can have the same for Trident also. was: Would be very useful to have some simple reference topologies included with Storm that can be used to measure performance that can be used both by devs during development (to start with) and perhaps also on a real storm cluster (subsequently). To start with, the goal is to put the focus on the performance characteristics of individual building blocks such as specifics bolts, spouts, grouping options, queues, etc. So, initially biased towards micro-benchmarking but subsequently we could add higher level ones too. Although there is a storm benchmarking tool (originally written by Intel?) that can be used, and i have personally used, its better for this to be integrated into Storm proper and also maintained by devs as storm evolves. On a side note, in some instances I have noticed (to my surprise) that the perf numbers change when the topologies written for Intel benchmark when rewritten without the required wrappers so that they runs directly under Storm. Have a few topologies in mind for measuring each of these: # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but no bolts. # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> DevNull Bolt # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt # *Hdfs Spout:* Hdfs Spout -> DevNull Botl # *Kafka Spout:* Kafka Spout -> DevNull Bolt # *Simple Data Movement*: Kafka Spout -> Hdfs Bolt Shall add these for Storm core first. Then we can have the same for Trident also. > Create topologies for measuring performance > --- > > Key: STORM-1772 > URL: https://issues.apache.org/jira/browse/STORM-1772 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > > Would be very useful to have some simple reference topologies included with > Storm that can be used to measure performance both by devs during development > (to start with) and perhaps also on a real storm cluster (subsequently). > To start with, the goal is to put the focus on the performance > characteristics of individual building blocks such as specifics bolts, > spouts, grouping options, queues, etc. So, initially biased towards > micro-benchmarking but subsequently we could add higher level ones too. > Although there is a storm benchmarking tool (originally written by Intel?) > that can be used, and i have personally used it, its better for this to be > integrated into Storm proper and also maintained by devs as storm evolves. > On a side note, in some instances I have noticed (to my surprise) that the > perf numbers change when the topologies written for Intel benchmark when > rewritten without the required wrappers so that they runs directly under > Storm. > Have a few topologies in mind for measuring each of these: > # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but > no bolts. > # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> > DevNull Bolt > # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt > # *Hdfs Spout:* Hdfs Spout -> DevNu
[jira] [Created] (STORM-1772) Create topologies for measuring performance
Roshan Naik created STORM-1772: -- Summary: Create topologies for measuring performance Key: STORM-1772 URL: https://issues.apache.org/jira/browse/STORM-1772 Project: Apache Storm Issue Type: Bug Reporter: Roshan Naik Would be very useful to have some simple reference topologies included with Storm that can be used to measure performance that can be used both by devs during development (to start with) and perhaps also on a real storm cluster (subsequently). To start with, the goal is to put the focus on the performance characteristics of individual building blocks such as specifics bolts, spouts, grouping options, queues, etc. So, initially biased towards micro-benchmarking but subsequently we could add higher level ones too. Although there is a storm benchmarking tool (originally written by Intel?) that can be used, and i have personally used, its better for this to be integrated into Storm proper and also maintained by devs as storm evolves. On a side note, in some instances I have noticed (to my surprise) that the perf numbers change when the topologies written for Intel benchmark when rewritten without the required wrappers so that they runs directly under Storm. Have a few topologies in mind for measuring each of these: # *Queuing and Spout Emit Performance:* A topology with a Generator Spout but no bolts. # *Queuing & Grouping performance*: Generator Spout -> A grouping method -> DevNull Bolt # *Hdfs Bolt:*Generator Spout -> Hdfs Bolt # *Hdfs Spout:* Hdfs Spout -> DevNull Botl # *Kafka Spout:* Kafka Spout -> DevNull Bolt # *Simple Data Movement*: Kafka Spout -> Hdfs Bolt Shall add these for Storm core first. Then we can have the same for Trident also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1910) One topology can't use hdfs spout to read from two locations
[ https://issues.apache.org/jira/browse/STORM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378435#comment-15378435 ] Roshan Naik commented on STORM-1910: [~ptgoetz] should this be marked for 1.0.2 as well ? > One topology can't use hdfs spout to read from two locations > > > Key: STORM-1910 > URL: https://issues.apache.org/jira/browse/STORM-1910 > Project: Apache Storm > Issue Type: Bug > Components: storm-hdfs >Affects Versions: 1.0.1 >Reporter: Raghav Kumar Gautam >Assignee: Roshan Naik > Fix For: 2.0.0, 1.1.0 > > > The hdfs uri is passed using config: > {code} > conf.put(Configs.HDFS_URI, hdfsUri); > {code} > I see two problems with this approach: > 1. If someone wants to used two hdfsUri in same or different spouts - then > that does not seem feasible. > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/examples/storm-starter/src/jvm/storm/starter/HdfsSpoutTopology.java#L117-L117 > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/spout/HdfsSpout.java#L331-L331 > {code} > if ( !conf.containsKey(Configs.SOURCE_DIR) ) { > LOG.error(Configs.SOURCE_DIR + " setting is required"); > throw new RuntimeException(Configs.SOURCE_DIR + " setting is required"); > } > this.sourceDirPath = new Path( conf.get(Configs.SOURCE_DIR).toString() ); > {code} > 2. It does not fail fast i.e. at the time of topology submissing. We can fail > fast if the hdfs path is invalid or credentials/permissions are not ok. Such > errors at this time can only be detected at runtime by looking at the worker > logs. > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/spout/HdfsSpout.java#L297-L297 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-1949) Storm backpressure can cause spout to stop emitting and stall topology
Roshan Naik created STORM-1949: -- Summary: Storm backpressure can cause spout to stop emitting and stall topology Key: STORM-1949 URL: https://issues.apache.org/jira/browse/STORM-1949 Project: Apache Storm Issue Type: Bug Reporter: Roshan Naik Problem can be reproduced by this [Word count topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] within a IDE. I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt instances. The problem is more easily reproduced with WC topology as it causes an explosion of tuples due to splitting a sentence tuple into word tuples. As the bolts have to process more tuples than the spout is producing, spout needs to operate slower. The amount of time it takes for the topology to stall can vary.. but typically under 10 mins. *My theory:* I suspect there is a race condition in the way ZK is being utilized to enable/disable back pressure. When congested (i.e pressure exceeds high water mark), the bolt's worker records this congested situation in ZK by creating a node. Once the congestion is reduced below the low water mark, it deletes this node. The spout's worker has setup a watch on the parent node, expecting a callback whenever there is change in the child nodes. On receiving the callback the spout's worker lists the parent node to check if there are 0 or more child nodes it is essentially trying to figure out the nature of state change in ZK to determine whether to throttle or not. Subsequently it setsup another watch in ZK to keep an eye on future changes. When there are multiple bolts, there can be rapid creation/deletion of these ZK nodes. Between the time the worker receives a callback and sets up the next watch.. many changes may have undergone in ZK which will go unnoticed by the spout. The condition that the bolts are no longer congested may not get noticed as a result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1949: --- Summary: Backpressure can cause spout to stop emitting and stall topology (was: Storm backpressure can cause spout to stop emitting and stall topology) > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-1956) Disable Backpressure by default
Roshan Naik created STORM-1956: -- Summary: Disable Backpressure by default Key: STORM-1956 URL: https://issues.apache.org/jira/browse/STORM-1956 Project: Apache Storm Issue Type: Bug Components: storm-core Affects Versions: 1.0.0, 1.0.1 Reporter: Roshan Naik Assignee: Roshan Naik Some of the context on this is captured in STORM-1949 In short.. wait for BP mechanism to mature some more and be production ready before we enable by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368451#comment-15368451 ] Roshan Naik edited comment on STORM-1949 at 7/8/16 8:53 PM: [~revans2] Not sure what you mean by "back on the write" .. are u saying have background thread that simply polls ZK every so often ? That might fix this issue. However, there is one basic issue with this BP mechanism in general. It can put too much load on ZK. For each enable/disable throttle signal raised by any worker we have all this interaction going on with ZK.. - Some worker adds/deletes ZK node - ZK issues callbacks to all workers with watches setup - All those workers will list the parent node in ZK to count the number of children (expensive?) - All those workers will setup another watch in ZK Given that PaceMaker was introduced to take load off of ZK... this approach feels like a regression in terms of ability to scale. There are some other issues as well but thats for later. After reviewing BP, I done feel it is sufficiently mature to be considered stable and ready for production. IMO Until we have a more solid BP mechanism we should disable it by default as soon as possible. I can open another jira for that. was (Author: roshan_naik): [~revans2] Not sure what you mean by "back on the write" .. are u saying have background thread that simply polls ZK every so often ? That might fix this issue. However, there is one basic issue with this BP mechanism in general. Its can put too much load on ZK. For each enable/disable throttle signal raised by any worker we have all this interaction going on with ZK.. - Some worker adds/deletes ZK node - ZK issues callbacks to all workers with watches setup - All those workers will list the parent node in ZK to count the number of children (expensive?) - All those workers will setup another watch in ZK Given that PaceMaker was introduced to take load off of ZK... this approach feels like a regression. There are some other issues as well but thats for later. After reviewing BP, I feel it is not mature enough to be considered stable and ready for production. IMO Until we have a more solid BP mechanism we should disable it by default as soon as possible. I can open another jira for that. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368451#comment-15368451 ] Roshan Naik edited comment on STORM-1949 at 7/8/16 8:50 PM: [~revans2] Not sure what you mean by "back on the write" .. are u saying have background thread that simply polls ZK every so often ? That might fix this issue. However, there is one basic issue with this BP mechanism in general. Its can put too much load on ZK. For each enable/disable throttle signal raised by any worker we have all this interaction going on with ZK.. - Some worker adds/deletes ZK node - ZK issues callbacks to all workers with watches setup - All those workers will list the parent node in ZK to count the number of children (expensive?) - All those workers will setup another watch in ZK Given that PaceMaker was introduced to take load off of ZK... this approach feels like a regression. There are some other issues as well but thats for later. After reviewing BP, I feel it is not mature enough to be considered stable and ready for production. IMO Until we have a more solid BP mechanism we should disable it by default as soon as possible. I can open another jira for that. was (Author: roshan_naik): [~revans2] Not sure what you mean by "back on the write" .. are u saying have background thread that simply polls ZK every so often ? That might fix this issue. However, there is one basic issue with this BP mechanism in general. Its can put too much load on ZK. For each enable/disable throttle signal raised by any worker we have all this interaction going on with ZK.. - Some worker adds/deletes ZK node - ZK issues callbacks to all workers with watches setup - All those workers will list the parent node in ZK to count the number of children (expensive?) - All those workers will setup another watch in ZK Given that PaceMaker was introduced to take load off of ZK... this approach feels like a regression. There are some other issues as well but thats for a different JIRA. After reviewing BP, I feel it is not mature enough to be considered stable and ready for production. IMO Until we have a more solid BP mechanism we should disable it by default as soon as possible. I can open another jira for that. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368451#comment-15368451 ] Roshan Naik commented on STORM-1949: [~revans2] Not sure what you mean by "back on the write" .. are u saying have background thread that simply polls ZK every so often ? That might fix this issue. However, there is one basic issue with this BP mechanism in general. Its can put too much load on ZK. For each enable/disable throttle signal raised by any worker we have all this interaction going on with ZK.. - Some worker adds/deletes ZK node - ZK issues callbacks to all workers with watches setup - All those workers will list the parent node in ZK to count the number of children (expensive?) - All those workers will setup another watch in ZK Given that PaceMaker was introduced to take load off of ZK... this approach feels like a regression. There are some other issues as well but thats for a different JIRA. After reviewing BP, I feel it is not mature enough to be considered stable and ready for production. IMO Until we have a more solid BP mechanism we should disable it by default as soon as possible. I can open another jira for that. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1956) Disable Backpressure by default
[ https://issues.apache.org/jira/browse/STORM-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1956: --- Priority: Blocker (was: Major) > Disable Backpressure by default > --- > > Key: STORM-1956 > URL: https://issues.apache.org/jira/browse/STORM-1956 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Affects Versions: 1.0.0, 1.0.1 > Reporter: Roshan Naik >Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.2 > > > Some of the context on this is captured in STORM-1949 > In short.. wait for BP mechanism to mature some more and be production ready > before we enable by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1956) Disable Backpressure by default
[ https://issues.apache.org/jira/browse/STORM-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1956: --- Fix Version/s: 1.0.2 > Disable Backpressure by default > --- > > Key: STORM-1956 > URL: https://issues.apache.org/jira/browse/STORM-1956 > Project: Apache Storm > Issue Type: Bug > Components: storm-core >Affects Versions: 1.0.0, 1.0.1 > Reporter: Roshan Naik >Assignee: Roshan Naik >Priority: Blocker > Fix For: 1.0.2 > > > Some of the context on this is captured in STORM-1949 > In short.. wait for BP mechanism to mature some more and be production ready > before we enable by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368640#comment-15368640 ] Roshan Naik commented on STORM-1949: Have not worked out a concrete solution to avoiding ZK as yet. But [~sriharsha]'s line of thinking is interesting ... basically see if we can use the internal messaging system as opposed to messaging over ZK. Opened STORM-1956 for disabling BP by default. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1910) One topology can't use hdfs spout to read from two locations
[ https://issues.apache.org/jira/browse/STORM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367096#comment-15367096 ] Roshan Naik commented on STORM-1910: WRT Pt# 2 in the description, we cannot check for valid HDFS path on the client side as it cannot be assumed that HDFS is configure and available on the host from where the topology is being submitted. > One topology can't use hdfs spout to read from two locations > > > Key: STORM-1910 > URL: https://issues.apache.org/jira/browse/STORM-1910 > Project: Apache Storm > Issue Type: Bug > Components: storm-hdfs >Affects Versions: 1.0.1 >Reporter: Raghav Kumar Gautam >Assignee: Roshan Naik > Fix For: 1.1.0 > > > The hdfs uri is passed using config: > {code} > conf.put(Configs.HDFS_URI, hdfsUri); > {code} > I see two problems with this approach: > 1. If someone wants to used two hdfsUri in same or different spouts - then > that does not seem feasible. > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/examples/storm-starter/src/jvm/storm/starter/HdfsSpoutTopology.java#L117-L117 > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/spout/HdfsSpout.java#L331-L331 > {code} > if ( !conf.containsKey(Configs.SOURCE_DIR) ) { > LOG.error(Configs.SOURCE_DIR + " setting is required"); > throw new RuntimeException(Configs.SOURCE_DIR + " setting is required"); > } > this.sourceDirPath = new Path( conf.get(Configs.SOURCE_DIR).toString() ); > {code} > 2. It does not fail fast i.e. at the time of topology submissing. We can fail > fast if the hdfs path is invalid or credentials/permissions are not ok. Such > errors at this time can only be detected at runtime by looking at the worker > logs. > https://github.com/apache/storm/blob/d17b3b9c3cbc89d854bfb436d213d11cfd4545ec/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/spout/HdfsSpout.java#L297-L297 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423916#comment-15423916 ] Roshan Naik commented on STORM-1949: With BP disabled the topo ran fine. Dont think saw any NPE during my runs. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik > Attachments: 1.x-branch-works-perfect.png > > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roshan Naik updated STORM-1949: --- Attachment: wordcounttopo.zip Attaching the wordcount topo that i used. > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik >Assignee: Alessandro Bellina > Attachments: 1.x-branch-works-perfect.png, wordcounttopo.zip > > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433795#comment-15433795 ] Roshan Naik commented on STORM-1949: The amount of *additional* pressure this BP mechanism adds to ZK in it current state really should be sufficient reason to leave it disabled by default. If we fix the problem I noted in the description, as per Bobby's suggestion, that would put even more pressure on ZK. Putting such pressure on ZK (or Nimbus) from any subsystem in Storm is essentially a regression in terms of scaling ability, which then begets future fixes (PaceMaker for instance) > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug > Reporter: Roshan Naik >Assignee: Alessandro Bellina > Attachments: 1.x-branch-works-perfect.png, wordcounttopo.zip > > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)