Re: What's the best way to guarantee external delivery of messages with Storm
Will the HTTP event sink respond with some acknowledgement that it received whatever was sent? If so, could this be as simple as telling your bolt not to ack the tuple until this response is received from the HTTP service? -- Derek On 9/26/14 10:10, Peter Neumark wrote: Thanks for the quick response! Unfortunately, we're forced to use HTTP. Any ideas? On Fri, Sep 26, 2014 at 5:07 PM, Supun Kamburugamuva supu...@gmail.com wrote: On Fri, Sep 26, 2014 at 10:49 AM, Peter Neumark peter.neum...@prezi.com wrote: Hi all, We want to replace a legacy custom app with storm, but -being storm newbies- we're not sure what's the best way to solve the following problem: An HTTP endpoint returns the list of events which occurred between two timestamps. The task is to continuously poll this event source for new events, optionally perform some transformation and aggregation operations on them, and finally make an HTTP request to an endpoint with some events. We thought of a simple topology: 1. A clock-spout determines which time interval to process. 2. A bolt takes the time interval as input, and fetches the event list for that interval fro the event source, emitting them as individual tuples. 3. After some processing of the tuples, we aggregate them into fixed size groups, which we send in HTTP requests to an event sink. The big question is how to make sure that all events are successfully delivered to the event sink. I know storm guarantees the delivery of tuples within the topology, but how could I guarantee that the HTTP requests to the event sink are also successful (and retried if necessary). I think this is not a question about Storm and rather a question about how to deliver a message reliably to some sink. From my experience it is bit hard to achieve something like this with HTTP. This functionality is built in to message brokers like RabbitMQ, ActiveMQ, Kafka etc and if you use a broker to send your events to the sink you can get a delivery guarantee. Thanks, Supun.. All help, suggestions and pointers welcome! Peter -- *Peter Neumark* DevOps guy @Prezi http://prezi.com -- Supun Kamburugamuva Member, Apache Software Foundation; http://www.apache.org E-mail: supu...@gmail.com; Mobile: +1 812 369 6762 Blog: http://supunk.blogspot.com
Re: nette reconnects
This could be https://issues.apache.org/jira/browse/STORM-510 The send thread is blocked on a connection attempt, and so no messages get sent out until the connection is re-established or it times out. -- Derek On 9/26/14 13:47, Varun Vijayaraghavan wrote: I first tried increasing the max_retries to a much higher number (300) but that did not make a difference. On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan varun@gmail.com mailto:varun@gmail.com wrote: Hey, I've been facing the same issues in my topologies. It seems like a crash in a single worker would trigger a reconnect from other workers for x amount of time (30 x 10s = ~300 seconds in your case) before crashing themselves - thus leading to a catastrophic failure in the topology. There is a patch in 0.9.3 related to exponential backoff for netty connections - which may address the issue - but until then I did two things - a) increase the max_wait_ms to 15000 and b) decrease supervisor.worker.start.timeout.secs to 30 - so that workers restart earlier. On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com mailto:tnor...@adobe.com wrote: Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880 http://10.27.10.180:33880 = /10.27.13.121:6700 http://10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881 http://10.27.10.180:33881 = /10.27.13.121:6700 http://10.27.13.121:6700] -- - varun :) -- - varun :)
Re: secure storm UI
This is available in the security branch. See https://github.com/apache/storm/blob/security/SECURITY.md You do not need to enable all of the security features to get UI auth. For authentication, look at ui.filter and ui.filter.params. For authorization, nimbus.admins, ui.users, logs.users, and topology.users -- Derek On 9/26/14 14:34, Kushan Maskey wrote: Is there a way to secure the storm UI page. Like enable logging to access the page so only authorized people can only access it. -- Kushan Maskey 817.403.7500 M. Miller Associates http://mmillerassociates.com/ kushan.mas...@mmillerassociates.com mailto:kushan.mas...@mmillerassociates.com
Re: Please fix the code samples in the documentation
I think this has been pointed out before. It is being tracked: https://issues.apache.org/jira/browse/STORM-385 -- Derek On 9/2/14, 15:34, Andras Hatvani wrote: Hi, To the Storm-developers: Please fix the code samples in the documentation, because currently every single one is unformatted, without syntax highlighting and in one row. Thanks in advance, Andras
Re: [DISCUSS] Apache Storm Release 0.9.3/0.10.0
I am supportive. I think it makes sense to move to 0.10.0 because of the significance of the changes. -- Derek On 8/28/14, 15:34, P.Taylor Goetz wrote: I’d like to gather community feedback for the next two releases of Apache Storm. 0.9.3-incubating will be our next release. Please indicate (by JIRA ticket ID) which bug fixes and/or new features you would like to be considered for inclusion in the next release. If there is not an existing for a particular issue or feature, please consider adding one. For the next and subsequent releases, we will be using a slightly different approach than what we did in the past. Instead of voting right away on a build, we will make one or more “unofficial” release candidate builds available prior to voting on an official release. This will give the Apache Storm community more time to discuss, evaluate, identify and fix potential issues before the official release. This should enable us to ensure the final release is as bug free as possible. Apache Storm 0.10.0 (STORM-216) As some of you are aware, the engineering team at Yahoo! has done a lot of work to bring security and multi-tenancy to Storm, and has contributed that work back to the community. Over the past few months we have been in the process of enhancing and syncing that work with the master branch in a separate branch labeled “security.” That work is now nearing completion, and I would like us to consider merging it into master after the 0.9.3 release. Since the security work includes a large number of changes and enhancements, I propose we bump the version number to 0.10.0 for the first release to include those features. More information about the security branch can be found in this pull request [1], as well as the SECURITY.md file in the security branch [2]. I also discussed it in a blog post [3] on the Hortonworks website. Please feel free to direct any comments or questions about the security branch to the mailing list. Similar to the process we’ll follow for 0.9.3, we plan to make several unofficial “development” builds available for those who would like to help with testing the new security features. -Taylor [1] https://github.com/apache/incubator-storm/pull/121 [2] https://github.com/apache/incubator-storm/blob/security/SECURITY.md [3] http://hortonworks.com/blog/the-future-of-apache-storm/
Re: Create multiple supervisors on same node
I also tried another scenario: instead of copying the entire storm home directory, I only use one storm home, but different storm-local dir and ports, which both are specified in storm.yaml, I can still create multiple supervisors. (Of course, every time before I start a new supervisor, I have to update the storm.yaml for different storm-local dir and ports). You will have two supervisors writing to the same log. I recommend creating two distinct storm home directories unless you have a good reason to have them shared. I think the code assumes it is the only supervisor writing in storm home. -- Derek On 8/22/14, 14:08, Yu, Tao wrote: Thanks Harsha! Just cleaned zookeeper data (stop and re-start zookeeper) and tried again, now I can create multiple supervisors successfully! I also tried another scenario: instead of copying the entire storm home directory, I only use one storm home, but different storm-local dir and ports, which both are specified in storm.yaml, I can still create multiple supervisors. (Of course, every time before I start a new supervisor, I have to update the storm.yaml for different storm-local dir and ports). So my new questions are: 1) what's the best approach to create multiple supervisors on same node: a) each supervisor has it's own storm home directory;Or b) all supervisors have common storm home directory. in both approaches, supervisors have its own storm-local dir and ports. 2) when start supervisor, can we specify storm to use custom configuration (.yaml)? For example, like: $bin/storm supervisor --config conf/myConfig.yaml Seems like storm will always use conf/storm.yaml, and I do not see any document mentions about specifying custom config file. Thanks, -Tao -Original Message- From: Harsha [mailto:st...@harsha.io] Sent: Friday, August 22, 2014 12:57 PM To: user@storm.incubator.apache.org Subject: Re: Create multiple supervisors on same node Tao, I tried the above steps I am able to run two supervisors on the same node. Did you check the logs for supervisor under storm2. If it didn't created a local_dir/storm dir than your supervisor daemon might not be running. check for logs if there are any errors. -Harsha On Fri, Aug 22, 2014, at 09:20 AM, Yu, Tao wrote: Thanks Harsha! I tried your way, and here is what I have (major parts) in my storm.yaml: storm.local.dir: /opt/grid/tao/storm/storm-0.8.2/local_data/storm supervisor.slots.ports: - 6700 - 6701 1) I created the 1st supervisor, and I can see specified sub-folder local_data/storm/supervisor was created under opt/grid/tao/storm/storm-0.8.2. That's OK! 2) then I copied the entire storm-0.8.2 folder to a new storm2 (/opt/grid/tao/storm/storm2) 3) delete the sub-folder local_data under storm2 4) updated the storm.yaml under storm2 with below change: storm.local.dir: /opt/grid/tao/storm/storm2/local_data/storm supervisor.slots.ports: - 8700 - 8701 5) under storm2, create a new supervisor. Then the new supervisor still has the 1st supervisor's ID. And under storm2, the sub-folder local_data/storm was not created. Does storm still use the 1st storm home directory (storm/storm-0.8.2) local_data folder? Thanks, -Tao -Original Message- From: Harsha [mailto:st...@harsha.io] Sent: Friday, August 22, 2014 11:28 AM To: user@storm.incubator.apache.org Subject: Re: Create multiple supervisors on same node Tao, you need to delete the storm-local dir under your copied over storm dir ( storm2). Otherwise it will still pick up the same supervisor-id. -Harsha On Fri, Aug 22, 2014, at 08:16 AM, Yu, Tao wrote: Thanks Derek! I tried your suggestion, copied the entire storm home directory (which, in my case, is storm-0.8.2) to a new directory storm2, then in storm2 directory, I changed the conf/storm.yaml with different ports, and tried to create a new supervisor. Still, got the same supervisor ID as the 1st one (which I created from storm-0.8.2 directory). Did I do anything incorrectly? -Tao -Original Message- From: Derek Dagit [mailto:der...@yahoo-inc.com] Sent: Friday, August 22, 2014 11:01 AM To: user@storm.incubator.apache.org Subject: Re: Create multiple supervisors on same node The two supervisors are sharing the same state, and that is how they get the same randomly-generated ID. If I recall correctly, the default state directory is created in the current working directory of the process, so that is whatever directory you happen to be in when you start the supervisor. I think probably a good thing to do is copy the entire storm home directory, change the storm.yaml in the copy to be configured with different ports as you tried, and make sure to cd into the appropriate directory when you launch the supervisor. -- Derek On 8/22/14, 9:49, Yu, Tao wrote: Hi all, Anyone knows what's the requirement
Re: worker always timeout to heartbeat and was restarted by supervisor
1) does it related to GC problem? This is usually the cause. As the worker consumes more and more of the heap, gc takes longer and longer each time. Eventually it takes so long the heartbeats to the supervisor do not happen. There could be a spike or skew in your data such that one or more workers cannot handle it with their heap settings. -- Derek On 8/20/14, 5:26, DashengJu wrote: hi, all In out production environment, we have a topology named logparser_mobile_nginx,it has 50 worker, spout have 48 executors, bolt_parser have 1000 executors and bolt_saver have 50 executors. The topology running normal most of times, but 1~5 workers restarted every 1~2 hours. When we see the logs of supervisor and worker, found 1) worker have no error or exception; 2) supervisor says the worker did not do heartbeat and timeout happened. because worker have no log, I do not know why worker did not do heartbeat. anyone have any ideas how to investigate? 0) is the worker exist caused? 1) does it related to GC problem? 2) does it related to Memory problem? If this, I think the JVM will report Memory Exception in worker log. By the way, some small topologies works well on the same environment. below is the supervisor log: -- 2014-08-20 15:51:33 b.s.d.supervisor [INFO] 90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing state for id c7e8d375-db76-4e2 9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State: :timed-out, Heartbeat: #backtype.sto rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id logparser_mobile_nginx-259-1408518 662, :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]}, :port 9714} 2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3 75-db76-4e29-8019-e783ab3cd6de 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901. Process is probably already dead . 2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921. Process is probably already dead . 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d b76-4e29-8019-e783ab3cd6de 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing state for id d5a8d578-89ff-4a5 0-a906-75e847ac63a1. Current supervisor time: 1408521676. State: :timed-out, Heartbeat: #backtype.sto rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id logparser_nginx-265-1408521077, : executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700} 2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5 78-89ff-4a50-a906-75e847ac63a1 2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068. Process is probably already dead . 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8 9ff-4a50-a906-75e847ac63a1 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing state for id 5154f643-cd79-411 9-9368-153f1bede757. Current supervisor time: 1408521676. State: :timed-out, Heartbeat: #backtype.sto rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id logparser_mobile_nginx-259-1408518 662, :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]}, :port 9720} 2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6 43-cd79-4119-9368-153f1bede757 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976. Process is probably already dead. 2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986. Process is probably already dead. 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time: 1408521676. State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id app_upload_urls-218-1408503096, :executors #{[8 8] [40 40] [24 24] [-1 -1]}, :port 9713} 2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba 2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177. Process is probably already dead. 2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down 6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
Re: NoSuchMethorError
I skimmed grepcode, and found that Yaml(BaseConstructor) was available from snakeyaml version 1.7 onward. I would check if a version of snakeyaml = 1.6 is in your classpath somehow. -- Derek On 8/4/14, 14:34, Ratay, Steve wrote: I am trying to run a local cluster using Storm 0.9.2, and getting a NoSuchMethodError. I am using Eclipse and have pulled all the Storm dependencies into my project. Most notably, I have the snakeyaml-1.11.jar file. Anyone else seeing this error or know where I've gone wrong? java.lang.NoSuchMethodError: org.yaml.snakeyaml.Yaml.init(Lorg/yaml/snakeyaml/constructor/BaseConstructor;)V at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:144) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.utils.Utils.readDefaultConfig(Utils.java:167) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.utils.Utils.readStormConfig(Utils.java:191) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.config$read_storm_config.invoke(config.clj:121) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.testing$mk_local_storm_cluster.doInvoke(testing.clj:123) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at clojure.lang.RestFn.invoke(RestFn.java:421) ~[clojure-1.5.1.jar:na] at backtype.storm.LocalCluster$_init.invoke(LocalCluster.clj:28) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.LocalCluster.init(Unknown Source) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at analytics.AnalyticsTopology.main(AnalyticsTopology.java:38) ~[classes/:na] Thanks, Steve
Re: intra-topology SSL transport
In the security branch of storm, worker-worker communication are encrypted (blowfish) with a shared secret. STORM-348 will add authentication to worker-worker. For thrift (nimbus drpc), the security branch has SASL/kerberos authentication, and you should be able to configure encryption via SASL as well. We have not tried enabling encryption with SASL. -- Derek On 7/23/14, 14:05, Isaac Councill wrote: Hi, I've been working with storm on mesos but I need to make sure all workers are messaging over SSL since streams may contain sensitive information for almost all of my use cases. stunnel seems like a viable option but I dislike having complex port forwarding arrangements and would prefer code to config in this case. As an exercise to see how much work it would be, I forked storm and modified the storm-netty package to use SSL with the existing nio. Not so bad, and lein tests pass. Still wrapping my head around the storm codebase. Would using my modified storm-netty Context as storm.messaging.transport be enough to ensure streams are encrypted, or would I need to also attack the thrift transport plugin? Also, is anyone else interested in locking storm down with SSL?
Re: Measuring a topology performance
What's the recommended way to measure the avg. time of the tuple spending in the topology until its full processing? You can do this with acking enabled. In the UI, go to a spout and look for Complete Latency. -- Derek On 7/13/14, 7:03, 唐 思成 wrote: UI has metric called latency means how long a bolt take to process a tuple 在 2014年7月13日,下午5:49,Vladi Feigin vladi...@gmail.com 写道: Hi All, What's the recommended way to measure the avg. time of the tuple spending in the topology until its full processing? We use Storm version 0.8.2 and have the topologies with acks and without. Thank you, Vladi
Re: Storm UI: not displayed Executors
This should be fixed with either STORM-370 (merged) or STORM-379 (pull request open). The script that sorts the table did not check that the list was of non-zero size before it attempted to sort, and that resulted in an exception that halted subsequent rendering on the page. You can: - checkout the latest storm and use that - cherry-pick commit 31c786c into the version you are using. -- Derek On 7/11/14, 5:54, 川原駿 wrote: Hello. Recently I upgraded storm to 0.9.2. In Component summary page of Storm UI, Executors is not displayed only when emitted of its spout/bolt is 0. Please tell me the solution. Thanks.
Re: Spout process latency
It should be a windowed average measure of the time between when the component receives a tuple and when it acks the tuple. This can be slower if there is batching, aggregating, or joining happening (the component must wait for a number of other tuples to arrive before it can ack). On the UI, there are tool tips that explain the measurements. They appear after hovering over the label. -- Derek On 7/9/14, 15:22, Raphael Hsieh wrote: Can somebody explain to me what might cause the spout to have a large process latency? Currently my spout0 and $spoutcoord-spout0 have latency's higher than I would like. I'm consuming data from a Kafka stream. How is this process latency measured ? Is it measuring the amount of time it takes to fill a batch with data and send it to the first bolt in the topology? Thanks
Re: what does each field of storm UI mean?
Adrian, If you hover over the title of the field, there should appear a pop-up to explain what it means. -- Derek On 6/17/14, 21:05, 이승진 wrote: Dear storm users I want to see performance of each bolt and decide the number of parallelism. In storm UI there are several fields which is confusing, so would be glad if you can tell me. Capacity(last 10m) - average capacity per one second in last 10 minute of a single executor? For example, if Capcity is 1.2, does that mean single executor processed 1.2 messages per second in average? Execute latency and Process latency - Is it average value or value of last processed message? and what is the difference between them? and what is the difference between them and Capacity? Sincerly, Adrian SJ Lee
Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state
:timed-out means that the worker did not heartbeat to the supervisor in time. (This happens on local disk.) Check that your workers have enough jvm heap space. If not, garbage collection for the JVM will cause progressively slower heartbeats until the supervisor thinks they are dead and kills them. topology.worker.childopts=-Xmx{{VALUE}} e.g. 2048m or 2g -- Derek On 6/14/14, 22:39, Justin Workman wrote: From what I have seen, if nimbus kills and reassigns the worker process, the supervisor logs will report that the worker is in a disallowed state. I have seen the supervisor report the worker in a timed out state and restart the worker processes, generally when the system is under heavy CPU load. We recently ran into this issue while running a topology on virtual machines. Increasing the number of virtual cores assigned to the vm's resolved the restart issues. Thanks Justin Sent from my iPhone On Jun 14, 2014, at 11:32 AM, Andrew Montalenti and...@parsely.com wrote: I am trying to understand why for a topology I am trying to run on 0.9.1-incubating, the supervisor on the machine is killing *all* of the topology's Storm workers periodically. Whether I use topology.workers=1,2,4, or 8, I always get logs like this: https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b Which basically indicates that the supervisor thinks all the workers timed out at exactly the same time, and then it kills them all. I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120 secs, but this hasn't helped at all. No matter what, periodically, the workers just get whacked by the supervisor and the whole topology has to restart. I notice that this does happen less frequently if the machine is under less load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or 200, then it runs for awhile without crashing. But I've even seen it crash in this state. I saw on some other threads that people indicated that the supervisor will kill all workers if the nimbus fails to see a heartbeat from zookeeper. Could someone walk me through how I could figure out if this is the case? Nothing in the logs seems to point me in this direction. Thanks! Andrew
Re: [VOTE] Storm Logo Contest - Final Round
#6 - 2pt. #9 - 2pts. #10 - 1pt. -- Derek On 6/9/14, 13:38, P. Taylor Goetz wrote: This is a call to vote on selecting the winning Storm logo from the 3 finalists. The three candidates are: * [No. 6 - Alec Bartos](http://storm.incubator.apache.org/2014/04/23/logo-abartos.html) * [No. 9 - Jennifer Lee](http://storm.incubator.apache.org/2014/04/29/logo-jlee1.html) * [No. 10 - Jennifer Lee](http://storm.incubator.apache.org/2014/04/29/logo-jlee2.html) VOTING Each person can cast a single vote. A vote consists of 5 points that can be divided among multiple entries. To vote, list the entry number, followed by the number of points assigned. For example: #1 - 2 pts. #2 - 1 pt. #3 - 2 pts. Votes cast by PPMC members are considered binding, but voting is open to anyone. In the event of a tie vote from the PPMC, votes from the community will be used to break the tie. This vote will be open until Monday, June 16 11:59 PM UTC. - Taylor
Re: Workers constantly restarted due to session timeout
1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services? I recommend separate, and even then to ZK storage to a path on its own disk device if possible. ZK is a bottleneck for storm, and when it is too slow lots of bad things can happen. Some folks use shared hosts (with or without VMs) in which to run ZK. In those situations, VMs or processes owned by other users doing unrelated things can put load on the disk, and that will dramatically slow down ZK. 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation? I am not sure about that, since we've been running with ZK 3.4.5 in storm (and on the server). It might work very well, but I have not tried it. I do not remember if anyone on this list has identified any issues with this 3.3.3 + 3.4.5 combo. One setting we changed to dramatically improve performance with ZK was setting the system property '-Dzookeeper.forceSync=no' on the server. Normally, ZK will sync to disk on every write, and that causes two seeks: one for the data and one for the data log. It gets really expensive with all of the workers heartbeating in through ZK. Be warned that with only on ZK server, an outage could leave you in an inconsistent state. You might check to see if the ZK server is keeping up. There are tools like iotop that can give information about disk load. -- Derek On 6/3/14, 13:14, Michael Dev wrote: Thank you Derek for the explanation between :disallowed and :timed-out. That was extremely helpful in understanding what decisions Storm is making. I increased the timeouts for both messages to 5 minutes and returned the zookeeper session timeouts to their default values. This made it plain to see periods in time where the Uptime column for the busiest component's Worker would not update (1-2 minutes, potentially never resulting in a worker restart). ZK logs report constant disconnects and reconnects while the Uptime is not updating: 16:28:30,440 - INFO NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:54004 which has sessionid 0x1464f1fddc1018f 16:31:18,364 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from /10.49.21.151:34419 16.31:18,365 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34419; will be dropped if server is in r-o mode 16:31:18,365 - INFO ZookeeperServer@832 - Client attempting to renew session 0x264f1fddc4021e at /10.49.21.151:34419 16:31:18,365 - INFO Learner@107 - Revalidating client: 0x264f1fddc4021e 16:31:18,366 - INFO ZooKeeperServer@588 - Invalid session 0x264f1fddc4021e for client /10.49.21.151:34419, probably expired 16:31:18,366 - NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:34419 which had sessionid 0x264f1fddc4021e 16:31:18,378 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from /10.49.21.151:34420 16:31:18,391 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34420; will be dropped if server is in r-o mode 16:31:18,392 - INFO ZookeeperServer@839 - Client attempting to establish new session at /10.49.21.151:34420 16:31:18,394 - INFO ZookeeperServer@595 - Established session 0x1464fafddc10218 with negotiated timeout 2 for client /10.49.21.151:34420 16.31.44,002 - INFO NIOServerCnxn@1001 - Closed socket connection for /10.49.21.151:34420 which had sessionid 0x1464fafddc10218 16.32.48,055 - INFO NIOServerCxnFactory@197 - Accepted socket connection from /10.49.21.151:34432 16:32:48,056 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34432; will be dropped if server is in r-o mode 16.32.48,056 - INFO ZookeeperServer@832 - Client attempting to renew session 0x2464fafddc4021f at /10.49.21.151:34432 16:32:48,056 - INFO Learner@107 - Revalidating client: 0x2464fafddc4021f 16:32:48,057 - INFO ZooKeeperServer@588 - Invalid session 0x2464fafddc4021f for client /10.49.21.151:34432, probably expired 16:32:48,057 - NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:34432 which had sessionid 0x2464fafddc4021f ...etc until Storm has had enough and restarts the worker resulting in this 16:47:20,706 - NIOServerCnxn@349 - Caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x3464f20777e01cf, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.langThread.run(Thread.java:745) 1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services? 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation? Date: Sat, 31 May 2014 13:50:57 -0500 From: der...@yahoo-inc.com To: user@storm.incubator.apache.org Subject: Re: Workers
Re: Workers constantly restarted due to session timeout
Are you certain that nimbus.task.timeout.secs is the correct config? That config controls the length of time before nimbus thinks a worker has timed out. https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372 Its default is 30 seconds. https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45 storm.zookeeper.connection.timeout: 30 storm.zookeeper.session.timeout: 30 So these will make the situation worse while workers losing connections to ZK, since it will cause the workers to wait longer before reconnecting. They could wait until nimbus thinks the worker is dead before trying to reconnect. supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id test-46-1400863199, :executors #{[-1 -1]}, :port 6700} Here if the State is :disallowed, then that means it is Nimbus that de-scheduled the worker on that node--very probably in this case because it thought it was dead. When the supervisor sees this, it will kill the worker. (A state of :timed-out means instead that the worker did not heartbeat to its supervisor in time.) If the CPU load on the worker was high enough to prevent heartbeats, then I would expect to see :timed-out state above instead of :disallowed. The reason is that the worker has only 5 seconds to do those heartbeats, while it has 30 seconds to heartbeat to nimbus (via ZK). (More often what happens to cause this is memory has run out and garbage collection stops everything just long enough.) The real question is why connections from the worker to ZK are timing out in the first place. What about the ZK servers? Sometimes ZooKeeper servers cannot keep up, and that causes pretty severe problems with timeouts. -- Derek On 5/30/14, 17:51, Michael Dev wrote: Michael R, We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify. Derek D, Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class. I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods. Our current setup is as follows: Storm 0.9.0.1 3 Storm node cluster 1 Supervisor running per Storm node 1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems) 3 Workers per topology Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution. Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts. zoo.cfg tickTime=2000 dataDir=/srv/zookeeper/data clientPort=2182 initLimit=5 syncLimit=2 skipACL=true maxClientCnxns=1000 maxSessionTimeout=30 server.1=node1 server.2=node2 server.3=node3 storm yaml storm.zookeeper.port: 2182 storm.local.dir=/srv/storm/data nimbus.host: node1 storm.zookeeper.servers: - node1 - node2 - node3 supervisor.slot.ports: - 6700 - 6701 - 6702 - 6703 - 6704 java.library.path: /usr/lib:/srv/storm/lib #Storm 0.9 netty support storm.messaging.transport: backtype.storm.messaging.netty.Context storm.messaging.netty.server_worker_threads: 1 storm.messaging.netty.client_worker_threads: 1 storm.messaging.netty.buffer_size: 5242880 storm.messaging.netty.max_retries: 100 storm.messaging.netty.max_wait_ms: 1000 storm.messaging.netty.min_wait_ms: 100 # Timeout band-aids in testing topology.receiver.buffer.size: 2 storm.zookeeper.connection.timeout: 30 storm.zookeeper.session.timeout: 30 Date: Thu, 29 May 2014 12:56:19 -0500 From: der...@yahoo-inc.com To: user@storm.incubator.apache.org Subject: Re: Workers constantly restarted due to session timeout OK, so GC is probably not the issue. Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus
Re: Workers constantly restarted due to session timeout
2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect? Check your JVM max heap size (-Xmx). If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating. -- Derek On 5/23/14, 15:38, Michael Dev wrote: Hi all, We are seeing our workers constantly being killed by Storm with to the following logs: worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id test-46-1400863199, :executors #{[-1 -1]}, :port 6700} Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%). I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages: storm.zookeeper.connection.timeout: 30 storm.zookeeper.session.timeout: 30 1) What timeout config is appropriate for the above timeout message? 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect? Thanks, Michael
Re: Test timed out (5000ms)
https://git.corp.yahoo.com/storm/storm/blob/master-security/storm-core/src/clj/backtype/storm/testing.clj#L167 Try changing this. The time-out was added to prevent the case when a test would hang indefinitely. Five seconds was thought to be more than enough time to let tests pass. If it needs to be longer we could increase it. If you continue to see the time-out, it could be that the test really is hanging somehow. -- Derek On 5/19/14, 4:57, Sergey Pichkurov wrote: Hello, Storm community. I trying to write unit test with storm.version :0.9.1-incubating storm-kafka-0.8-plus: 0.4.0 My topology have one Kafka spout and one storing bolt which has Spring inside(context initialized in prepare() method). When I running test with Testing.completeTopology(), I am getting error: java.lang.AssertionError: Test timed out (5000ms) at backtype.storm.testing$complete_topology.doInvoke(testing.clj:475) at clojure.lang.RestFn.invoke(RestFn.java:826) at backtype.storm.testing4j$_completeTopology.invoke(testing4j.clj:61) at backtype.storm.Testing.completeTopology(Unknown Source) This error not always arise, sometime test pass successfully. Where I can change this timeout parameter? Or how can I disable this timeout?
Re: Test timed out (5000ms)
https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/testing.clj#L187 Corrected link. -- Derek On 5/19/14, 10:10, Sergey Pichkurov wrote: I think that 5 sec is not always enought for Spring init. Your link is not resolving. *Pichkurov Sergey, Java Developer* On Mon, May 19, 2014 at 5:18 PM, Derek Dagit der...@yahoo-inc.com wrote: Try changing this. The time-out was added to prevent the case when a test would hang indefinitely. Five seconds was thought to be more than enough time to let tests pass. If it needs to be longer we could increase it. If you continue to see the time-out, it could be that the test really is hanging somehow. -- Derek On 5/19/14, 4:57, Sergey Pichkurov wrote: Hello, Storm community. I trying to write unit test with storm.version :0.9.1-incubating storm-kafka-0.8-plus: 0.4.0 My topology have one Kafka spout and one storing bolt which has Spring inside(context initialized in prepare() method). When I running test with Testing.completeTopology(), I am getting error: java.lang.AssertionError: Test timed out (5000ms) at backtype.storm.testing$complete_topology.doInvoke(testing.clj:475) at clojure.lang.RestFn.invoke(RestFn.java:826) at backtype.storm.testing4j$_completeTopology.invoke(testing4j.clj:61) at backtype.storm.Testing.completeTopology(Unknown Source) This error not always arise, sometime test pass successfully. Where I can change this timeout parameter? Or how can I disable this timeout?
Re: Weirdness running topology on multiple nodes
That is odd. I have seen things like this happen when there are DNS configuration issues, but you have even updated /etc/hosts. * What does /etc/nsswitch.conf have for the hosts entry? This is what mine has: hosts: files dns I think that the java resolver code honors this setting, and this will cause it to look at /etc/hosts first for resolution. * Firewall settings could also cause this. (Pings would work while worker-worker communications might not.) * Failing that, maybe watch network packets to discover with what the workers really trying to communicate? -- Derek On 5/7/14, 10:11, Justin Workman wrote: We have spent the better part of 2 weeks now trying to get a pretty basic topology running across multiple nodes. I am sure I am missing something simple but for the life of me I cannot figure it out. Here is the situation, I have 1 nimbus server and 5 supervisor servers, with Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. If all of the guests are running on the same physical hyperisor then the topology starts up just fine and runs without any issues. However, if we take the guests and spread them out over multiple hypervisors ( in the same OpenStack cluster ), the topology never really completely starts up. Things start to run, some messages are pulled off the spout, but nothing ever makes it all the way through the topology and nothing is ever ack'd. In the worker logs we get messages about reconnecting and eventually a Remote host unreachable error, and Async Loop Died. This used to result in a NumberFormat exception, reducing the netty retries from 30 to 10 resloved the NumberFormat error, and not we get the following 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) ~[na:na] at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) ~[na:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:125) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319) ~[na:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308) ~[na:na] at backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58) ~[na:na] And in the supervisor logs we see errors about the workers timing out and not starting up all the way, we also see executor timeouts in the nimbus logs. But we do not see any errors in the Zookeeper logs and the Zookeeper stats look fine. There do not appear to be any real network issues, I can run a continuous flood ping, between the hosts, with varying packet sizes, with minimal latency, and no dropped packets. I have also attempted to add all hosts to the local hosts files on each machine without any difference. I have also played with adjusting the different heartbeat timeouts and intervals with out any luck, and I have also deployed this same setup to a 5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local disks ), and we had the same issue. Topology would start, but data ever made it through the topology. The only way I have ever been able to get the topology to work
Re: Topologies are disappearing??? How to debug?
Make sure you do not have a second nimbus daemon running by accident. I saw this one time after someone had launched nimbus on a different host, yet the file system on which nimbus was storing its state was an NFS mount. It took a comically long time to figure out that a the second remote nimbus daemon was clearing state as soon as the first local daemon was writing it. -- Derek On 5/1/14, 11:59, Software Dev wrote: Over the last several days some/all of our topologies are disappearing from Nimbus. What are some possible explanations for this? Where should I look to debug this problem? Thanks
Re: Is it a must to have /etc/hosts mapping or a DNS in a multinode setup?
I have not tried it, but there is a config for this purpose: https://github.com/apache/incubator-storm/blob/dc4de425eef5701ccafe0805f08eeb011baae0fb/storm-core/src/jvm/backtype/storm/Config.java#L122-L131 -- Derek On 4/29/14, 0:41, Sajith wrote: Hi all, Is it a must to have a /etc/hosts mapping or a DNS in a multinode storm cluster? Can't supervisors talk to each other through ZooKeeper or nimbus using IP addresses directly ?
Re: storm starter ExclamationTopology
In your storm cluster , you need to verify first, nimbus is running properly or not. Check nimbus.log in $STORM_HOME/logs directory for error logs. Also check nimbus.host parameter in ~/.storm/storm.yaml. Yeah, that's what I was writing in my reply. I'll go ahead and add below: Start nimbus again, and make sure it is up. If your nimbus host is the same host, try (assuming from here nimbus port is 6627): ``` $ telnet localhost 6627 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. ``` If you see that, then nimbus is up-and-running (accepting connections at least). Check: - storm.yaml files have correct nimbus.host and nimbus.thrift.port - firewall settings - routing (What interface did nimbus open a port on? `netstat -lnt | grep 6627`) If not, check: - Make sure likewise ZooKeeper is running. - logs/nimbus.log (Is there some other issue?) -- Derek On 4/24/14, 11:48, Nishu wrote: In your storm cluster , you need to verify first, nimbus is running properly or not. Check nimbus.log in $STORM_HOME/logs directory for error logs. Also check nimbus.host parameter in ~/.storm/storm.yaml. On Thu, Apr 24, 2014 at 9:56 PM, Bilal Al Fartakh alfartaj.bi...@gmail.comwrote: and the question is , what should I fix dear experts ? :) 2014-04-24 16:23 GMT+00:00 Bilal Al Fartakh alfartaj.bi...@gmail.com: ~/src/storm-0.8.1/bin/storm jar /root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.ExclamationTopology demo *I tried to run this and it said that the problem is with the nimbus connection , but my storm client (and supervisor in the same time ) is connected with my nimbus (shown in Strom UI )* Running: java -client -Dstorm.options= -Dstorm.home=/root/src/storm-0.8.1 -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -cp /root/src/storm-0.8.1/storm-0.8.1.jar:/root/src/storm-0.8.1/lib/asm-4.0.jar:/root/src/storm-0.8.1/lib/commons-codec-1.4.jar:/root/src/storm-0.8.1/lib/carbonite-1.5.0.jar:/root/src/storm-0.8.1/lib/kryo-2.17.jar:/root/src/storm-0.8.1/lib/clout-0.4.1.jar:/root/src/storm-0.8.1/lib/clojure-1.4.0.jar:/root/src/storm-0.8.1/lib/ring-servlet-0.3.11.jar:/root/src/storm-0.8.1/lib/hiccup-0.3.6.jar:/root/src/storm-0.8.1/lib/disruptor-2.10.1.jar:/root/src/storm-0.8.1/lib/tools.cli-0.2.2.jar:/root/src/storm-0.8.1/lib/snakeyaml-1.9.jar:/root/src/storm-0.8.1/lib/joda-time-2.0.jar:/root/src/storm-0.8.1/lib/jetty-util-6.1.26.jar:/root/src/storm-0.8.1/lib/commons-exec-1.1.jar:/root/src/storm-0.8.1/lib/jetty-6.1.26.jar:/root/src/storm-0.8.1/lib/servlet-api-2.5.jar:/root/src/storm-0.8.1/lib/jzmq-2.1.0.jar:/root/src/storm-0.8.1/lib/curator-framework-1.0.1.jar:/root/src/storm-0.8.1/lib/httpclient-4.1.1.jar:/root/src/storm-0.8.1/lib/slf4j-log4j12-1.5.8.jar:/root/src/storm-0.8.1/lib/clj-time-0.4.1.jar:/roo t/src/storm-0.8.1/lib/commons-lang-2.5.jar:/root/src/storm-0.8.1/lib/libthrift7-0.7.0.jar:/root/src/storm-0.8.1/lib/log4j-1.2.16.jar:/root/src/storm-0.8.1/lib/servlet-api-2.5-20081211.jar:/root/src/storm-0.8.1/lib/tools.logging-0.2.3.jar:/root/src/storm-0.8.1/lib/ring-core-0.3.10.jar:/root/src/storm-0.8.1/lib/minlog-1.2.jar:/root/src/storm-0.8.1/lib/objenesis-1.2.jar:/root/src/storm-0.8.1/lib/jline-0.9.94.jar:/root/src/storm-0.8.1/lib/commons-io-1.4.jar:/root/src/storm-0.8.1/lib/ring-jetty-adapter-0.3.11.jar:/root/src/storm-0.8.1/lib/jgrapht-0.8.3.jar:/root/src/storm-0.8.1/lib/json-simple-1.1.jar:/root/src/storm-0.8.1/lib/tools.macro-0.1.0.jar:/root/src/storm-0.8.1/lib/commons-fileupload-1.2.1.jar:/root/src/storm-0.8.1/lib/compojure-0.6.4.jar:/root/src/storm-0.8.1/lib/httpcore-4.1.jar:/root/src/storm-0.8.1/lib/commons-logging-1.1.1.jar:/root/src/storm-0.8.1/lib/guava-13.0.jar:/root/src/storm-0.8.1/lib/curator-client-1.0.1.jar:/root/src/storm-0.8.1/lib/math.numeric-tower-0.0.1.jar:/roo t/src/storm-0.8.1/lib/junit-3.8.1.jar:/root/src/storm-0.8.1/lib/slf4j-api-1.5.8.jar:/root/src/storm-0.8.1/lib/reflectasm-1.07-shaded.jar:/root/src/storm-0.8.1/lib/core.incubator-0.1.0.jar:/root/src/storm-0.8.1/lib/zookeeper-3.3.3.jar:/root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar:/root/.storm:/root/src/storm-0.8.1/bin -Dstorm.jar=/root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.ExclamationTopology demo Exception in thread main java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused at backtype.storm.utils.NimbusClient.(NimbusClient.java:36) at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:17) at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:53) at storm.starter.ExclamationTopology.main(ExclamationTopology.java:59) Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift7.transport.TSocket.open(TSocket.java:183) at
Re: Storm UI
it displays different stats depending on whether or not I show/hide system stats. This is expected. There should be a tool tip on that button that says something like toggle inclusion of system components. How much the stats change seems to differ between Chrome and IE. That button is setting a cookie's value to true/false. The math is done on the server side and not in the browser, so the difference in the browser used should not matter beyond setting the cookie. If the toggle is not taking effect at all for some browsers, then we should create a new Issue to take a look. -- Derek On 4/2/14, 8:04, David Crossland wrote: I have a curious issue with the UI, it displays different stats depending on whether or not I show/hide system stats. How much the stats change seems to differ between Chrome and IE. Is this a known issue? Thanks David
Re: Error when trying to use multilang in a project built from scratch (not storm-starter)
Would you check your supervisor.log for a message like: Could not extract dir from jarpath -- Derek On 3/10/14, 17:26, Chris James wrote: Derek: No I cannot cd into that directory, but I can cd into the directory one up from it (dummy-topology-1-1394418571). That directory contains stormcode.ser and stormconf.ser files. The topology is running locally for testing, so I'm not launching any separate supervisor daemon. It just seems like it never even attempted to create the resources directory (but it successfully created all the ancestor directories), as the folder isn't really locked down at all. P. Taylor: I get what you're implying, but Eclipse is being run as an administrator already, and I am debugging the topology locally straight out of eclipse. It seems bizarre that there would be permissions issues on a folder that the project itself created. On Mon, Mar 10, 2014 at 6:16 PM, Derek Dagit der...@yahoo-inc.com mailto:der...@yahoo-inc.com wrote: Two quick thoughts: - Can you cd to 'C:\Users\chris\AppData\Local\__Temp\67daff0e-7348-46ee-9b62-__83f8ee4e431c\supervisor\__stormdist\dummy-topology-1-__1394418571\resources' from the shell as yourself? - What are the permissions on that directory? Is the supervisor daemon running as another user? -- Derek On 3/10/14, 17:05, P. Taylor Goetz wrote: I don't have access to a windows machine at the moment, but does this help? http://support.microsoft.com/__kb/832434 http://support.microsoft.com/kb/832434 On Mar 10, 2014, at 4:51 PM, Chris James chris.james.cont...@gmail.com mailto:chris.james.cont...@gmail.com mailto:chris.james.contact@__gmail.com mailto:chris.james.cont...@gmail.com wrote: Reposting since I posted this before at a poor time and got no response. I'm trying out a storm project built from scratch in Java, but with a Python bolt. I have everything running with all Java spouts/bolts just fine, but when I try to incorporate a python bolt I am running into issues. I have my project separated into a /storm/ for topologies, /storm/bolts/ for bolts, /storm/spouts for spouts, and /storm/multilang/ for the multilang wrappers. Right now the only thing in /storm/multilang/ is storm.py, copied and pasted from the storm-starter project. In my bolts folder, I have a dummy bolt set up that just prints the tuple. I've virtually mimicked the storm-starter WordCountTopology example for using a python bolt, so I think the code is OK and the configuration is the issue. So my question is simple. What configuration steps do I have to set up so that my topology knows where to look to find storm.py when I run super(python, dummypythonbolt.py)? I noticed an error in the stack trace claiming that it could not run python (python is definitely on my path and I use it everyday), and that is looking in a resources folder that does not exist. Here is the line in question: Caused by: java.io.IOException: Cannot run program python (in directory C:\Users\chris\AppData\Local\__Temp\67daff0e-7348-46ee-9b62-__83f8ee4e431c\supervisor\__stormdist\dummy-topology-1-__1394418571\resources): CreateProcess error=267, The directory name is invalid A more extensive stack trace is here: http://pastebin.com/6yx97m0M So once again: what is the configuration step that I am missing to allow my topology to see storm.py and be able to run multilang spouts/bolts in my topology? Thanks!
Re: java.lang.OutOfMemoryError: Java heap space in Nimbus
Yes, set 'nimbus.childopts: -Xmx?' in your storm.yaml, and restart nimbus. If unset, I believe the default is -Xmx1024m, for a max of 1024 MB heap. You can set it to -Xmx2048m, for example, to have a max heap size of 2048 MB. Set this on the node that runs nimbus, not in your topology conf. -- Derek On 3/10/14, 14:19, shahab wrote: Hi, I am facing OutOfMemoryError: Java heap space exception in Nimbus while running in cluster mode. I just wonder what are the possible JVM or Storm options that I can set to overcome this problem? I am running a storm topology in Cluster mode where all servers (zookeeper, nimbus, supervisor and worker) are in one machine. The toplogy that I use is as follows: conf.setMaxSpoutPending(2000); // maximum number of pending messages at spout conf.setNumWorkers(4); conf.put(Config.STORM_ZOOKEEPER_CONNECTION_TIMEOUT, 12); conf.setMaxTaskParallelism(2); but I get the following Exception in Nimbus log file: java.lang.OutOfMemoryError: Java heap space at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:271) at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:219) at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1136) at backtype.storm.daemon.nimbus$read_storm_topology.invoke(nimbus.clj:305) at backtype.storm.daemon.nimbus$compute_executors.invoke(nimbus.clj:407) at backtype.storm.daemon.nimbus$compute_executor__GT_component.invoke(nimbus.clj:420) at backtype.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:315) at backtype.storm.daemon.nimbus$mk_assignments$iter__3416__3420$fn__3421.invoke(nimbus.clj:636) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.RT.seq(RT.java:473) at clojure.core$seq.invoke(core.clj:133) at clojure.core.protocols$seq_reduce.invoke(protocols.clj:30) at clojure.core.protocols$fn__5875.invoke(protocols.clj:54) at clojure.core.protocols$fn__5828$G__5823__5841.invoke(protocols.clj:13) at clojure.core$reduce.invoke(core.clj:6030) at clojure.core$into.invoke(core.clj:6077) at backtype.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:635) at clojure.lang.RestFn.invoke(RestFn.java:410) at backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto3593$fn__3598$fn__3599.invoke(nimbus.clj:872) at backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto3593$fn__3598.invoke(nimbus.clj:871) at backtype.storm.timer$schedule_recurring$this__1776.invoke(timer.clj:69) at backtype.storm.timer$mk_timer$fn__1759$fn__1760.invoke(timer.clj:33) at backtype.storm.timer$mk_timer$fn__1759.invoke(timer.clj:26) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:744) 2014-03-10 20:10:02 NIOServerCnxn [ERROR] Thread Thread[pool-4-thread-40,5,main] died java.lang.OutOfMemoryError: Java heap space at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:271) at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:219) at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1136) at backtype.storm.daemon.nimbus$read_storm_topology.invoke(nimbus.clj:305) at backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto__$reify__3605.getTopologyInfo(nimbus.clj:1066) at backtype.storm.generated.Nimbus$Processor$getTopologyInfo.getResult(Nimbus.java:1481) at backtype.storm.generated.Nimbus$Processor$getTopologyInfo.getResult(Nimbus.java:1469) at org.apache.thrift7.ProcessFunction.process(ProcessFunction.java:32) at org.apache.thrift7.TBaseProcessor.process(TBaseProcessor.java:34) at org.apache.thrift7.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:632) at org.apache.thrift7.server.THsHaServer$Invocation.run(THsHaServer.java:201) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-03-10 20:10:02 util [INFO] Halting process: (Error when processing an event) best , /Shahab
Re: Error when trying to use multilang in a project built from scratch (not storm-starter)
Two quick thoughts: - Can you cd to 'C:\Users\chris\AppData\Local\Temp\67daff0e-7348-46ee-9b62-83f8ee4e431c\supervisor\stormdist\dummy-topology-1-1394418571\resources' from the shell as yourself? - What are the permissions on that directory? Is the supervisor daemon running as another user? -- Derek On 3/10/14, 17:05, P. Taylor Goetz wrote: I don't have access to a windows machine at the moment, but does this help? http://support.microsoft.com/kb/832434 On Mar 10, 2014, at 4:51 PM, Chris James chris.james.cont...@gmail.com mailto:chris.james.cont...@gmail.com wrote: Reposting since I posted this before at a poor time and got no response. I'm trying out a storm project built from scratch in Java, but with a Python bolt. I have everything running with all Java spouts/bolts just fine, but when I try to incorporate a python bolt I am running into issues. I have my project separated into a /storm/ for topologies, /storm/bolts/ for bolts, /storm/spouts for spouts, and /storm/multilang/ for the multilang wrappers. Right now the only thing in /storm/multilang/ is storm.py, copied and pasted from the storm-starter project. In my bolts folder, I have a dummy bolt set up that just prints the tuple. I've virtually mimicked the storm-starter WordCountTopology example for using a python bolt, so I think the code is OK and the configuration is the issue. So my question is simple. What configuration steps do I have to set up so that my topology knows where to look to find storm.py when I run super(python, dummypythonbolt.py)? I noticed an error in the stack trace claiming that it could not run python (python is definitely on my path and I use it everyday), and that is looking in a resources folder that does not exist. Here is the line in question: Caused by: java.io.IOException: Cannot run program python (in directory C:\Users\chris\AppData\Local\Temp\67daff0e-7348-46ee-9b62-83f8ee4e431c\supervisor\stormdist\dummy-topology-1-1394418571\resources): CreateProcess error=267, The directory name is invalid A more extensive stack trace is here: http://pastebin.com/6yx97m0M So once again: what is the configuration step that I am missing to allow my topology to see storm.py and be able to run multilang spouts/bolts in my topology? Thanks!
Re: [RELEASE] Apache Storm 0.9.1-incubating released (defaults.yaml)
The defaults.yaml file is part of the source distribution and is packaged into storm's jar when deployed. In a storm cluster deployment, it is not meant to be on the file system in ${storm.home}/conf. Perhaps you are pointing to your source working tree as storm home? -- Derek On 2/26/14, 5:59, Lajos wrote: Quick question on this: defaults.yaml is in both conf and storm-core.jar, so the first time you start nimbus 0.9.1 you get this message: java.lang.RuntimeException: Found multiple defaults.yaml resources. You're probably bundling the Storm jars with your topology jar. [file:/scratch/projects/apache-storm-0.9.1-incubating/conf/defaults.yaml, jar:file:/scratch/projects/apache-storm-0.9.1-incubating/lib/storm-core-0.9.1-incubating.jar!/defaults.yaml] at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:133) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] ... Shouldn't conf/defaults.yaml be called like conf/defaults.yaml.copy or something? I like that it is in the conf directory, because now I can easily see all the config options instead of having to go to the source directory. But it shouldn't prevent startup ... Thanks, Lajos On 22/02/2014 21:09, P. Taylor Goetz wrote: The Storm team is pleased to announce the release of Apache Storm version 0.9.1-incubating. This is our first Apache release. Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. You can read more about Storm on the project website: http://storm.incubator.apache.org Downloads of source and binary distributions are listed in our download section: http://storm.incubator.apache.org/downloads.html Distribution artifacts are available in Maven Central at the following coordinates: groupId: org.apache.storm artifactId: storm-core version: 0.9.1-incubating The full list of changes is available here[1]. Please let us know [2] if you encounter any problems. Enjoy! [1]: http://s.apache.org/Ki0 (CHANGELOG) [2]: https://issues.apache.org/jira/browse/STORM
Re: How to specify worker.childopts for a specified topology?
Try this: conf.put(Config.TOPOLOGY_WORKER_CHILDOPTS, WORKER_OPTS); Your WORKER_OPTS should be appended to WORKER_CHILDOPTS. -- Derek On 2/18/14, 1:47, Link Wang wrote: Dear all, I want to specify some worker.childopts for my topology inner it's code, and I use this way: conf.put(Config.WORKER_CHILDOPTS, WORKER_OPTS); but I found it doesn't work. I don't use storm.yaml file to set worker.childopts, because the memory requirement of my topologies are widely different. is there some one encounter the same problem?
Re: Storm 0.9.0.1 and Zookeeper 3.4.5 hung issue.
Some changes to storm code are necessary for this. See https://github.com/apache/incubator-storm/pull/29/files -- Derek On 2/14/14, 11:50, Saurabh Agarwal (BLOOMBERG/ 731 LEXIN) wrote: Thanks Bijoy for reply. We can't downgrade to 3.3.3 as our system has zookeeper 3.4.5 server running. and we would like to keep same version of zookeeper client to avoid any incompatibility issues. The error we are getting with 3.4.5 is. Caused by: java.lang.ClassNotFoundException: org.apache.zookeeper.server.NIOServerCnxn$Factory After looking at zookeeper code, static Factory class within NIOSeverCnxn class has been removed in 3.4.5 version. zookeeper version 3.3.3 is 3 years old. Should not Storm be updated the code to run with zookeeper latest version. Should I create a jira for this? - Original Message - From: user@storm.incubator.apache.org To: SAURABH AGARWAL (BLOOMBERG/ 731 LEXIN), user@storm.incubator.apache.org At: Feb 14 2014 11:45:50 Hi, We had also downgraded zookeeper from 3.4.5 to 3.3.3 due to issues with Storm.But we are not facing any issues related to Kafka after the downgrade.We are using Storm 0.9.0-rc2 and Kafka 0.8.0. Thanks Bijoy On Fri, Feb 14, 2014 at 9:57 PM, Saurabh Agarwal (BLOOMBERG/ 731 LEXIN) sagarwal...@bloomberg.net wrote: Hi, Storm 0.9.0.1 client linked with zookeeper 3.4.5 library hung on zookeeper initialize. Is it known issue? 453 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - init - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /tmp/7b520ac7-ff87-4eb6-9fc5-3a16deec0272/version-2 snapdir /tmp/7b520ac7-ff87-4eb6-9fc5-3a16deec0272/version-2 The client works fine with zookeeper 3.3.3. As we are using storm with kafka, kafka does not work with zookeeper 3.3.3 but work with 3.4.5. any help is appreciated... Thanks, Saurabh.
Re: Need to set worker environment variables or system properties
Yeah, use topology.worker.childopts when submitting. I believe it is appended to the cluster's worker.childopts. -- Derek On 11/20/13 15:28, Tom Brown wrote: Is there a way to manage the environment variables or system properties of each worker on a topology-by-topology basis? I would like to include a library with my code, but the library only supports configuration through environment variables or system properties. Thanks in advance! --Tom