Re: What's the best way to guarantee external delivery of messages with Storm

2014-09-26 Thread Derek Dagit
Will the HTTP event sink respond with some acknowledgement that it 
received whatever was sent?


If so, could this be as simple as telling your bolt not to ack the tuple 
until this response is received from the HTTP service?


--
Derek

On 9/26/14 10:10, Peter Neumark wrote:

Thanks for the quick response!
Unfortunately, we're forced to use HTTP.
Any ideas?

On Fri, Sep 26, 2014 at 5:07 PM, Supun Kamburugamuva supu...@gmail.com
wrote:


On Fri, Sep 26, 2014 at 10:49 AM, Peter Neumark peter.neum...@prezi.com
wrote:


Hi all,

We want to replace a legacy custom app with storm, but -being storm
newbies- we're not sure what's the best way to solve the following problem:

An HTTP endpoint returns the list of events which occurred between two
timestamps. The task is to continuously poll this event source for new
events, optionally perform some transformation and aggregation operations
on them, and finally make an HTTP request to an endpoint with some events.

We thought of a simple topology:
1. A clock-spout determines which time interval to process.
2. A bolt takes the time interval as input, and fetches the event list
for that interval fro the event source, emitting them as individual tuples.
3. After some processing of the tuples, we aggregate them into fixed size
groups, which we send in HTTP requests to an event sink.

The big question is how to make sure that all events are successfully
delivered to the event sink. I know storm guarantees the delivery of tuples
within the topology, but how could I guarantee that the HTTP requests to
the event sink are also successful (and retried if necessary).



I think this is not a question about Storm and rather a question about how
to deliver a message reliably to some sink. From my experience it is bit
hard to achieve something like this with HTTP. This functionality is built
in to message brokers like RabbitMQ, ActiveMQ, Kafka etc and if you use a
broker to send your events to the sink you can get a delivery guarantee.

Thanks,
Supun..




All help, suggestions and pointers welcome!
Peter

--

*Peter Neumark*
DevOps guy @Prezi http://prezi.com





--
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supu...@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com







Re: nette reconnects

2014-09-26 Thread Derek Dagit

This could be https://issues.apache.org/jira/browse/STORM-510

The send thread is blocked on a connection attempt, and so no messages 
get sent out until the connection is re-established or it times out.


--
Derek

On 9/26/14 13:47, Varun Vijayaraghavan wrote:

I first tried increasing the max_retries to a much higher number (300)
but that did not make a difference.

On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan
varun@gmail.com mailto:varun@gmail.com wrote:

Hey,

I've been facing the same issues in my topologies. It seems like a
crash in a single worker would trigger a reconnect from other
workers for x amount of time (30 x 10s = ~300 seconds in your case)
before crashing themselves - thus leading to a catastrophic failure
in the topology.

There is a patch in 0.9.3 related to exponential backoff for netty
connections - which may address the issue - but until then I did two
things - a) increase the max_wait_ms to 15000 and b) decrease
supervisor.worker.start.timeout.secs to 30 - so that workers restart
earlier.

On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com
mailto:tnor...@adobe.com wrote:

Hi -
We are seeing workers dying and restarting quite a bit,
apparently from netty connection issues.

For example, the log below shows:
* Reconnect for worker at 121:6700
* connection established to 121:6700
* closing connection to 121:6700
* Reconnect started to 121:6700

all within 1 second.

We have netty config updated to:
storm.messaging.netty.max_retries: 30
storm.messaging.netty.max_wait_ms: 1
storm.messaging.netty.min_wait_ms: 1000

And the workers die pretty quickly because often 30 retries does
not end up with a connection.

Any suggestions for how to prevent netting from closing a
connection immediately? I could not see any obvious reason in
the code that this would happen.

Thanks
Tyson

2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [5]
2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [6]
2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6701... [6]
2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6702... [6]
2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [6]
2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880
http://10.27.10.180:33880 = /10.27.13.121:6700
http://10.27.13.121:6700]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending
batchs to be sent with Netty-Client-/10.27.13.121:6700...,
timeout: 60ms, pendings: 0
2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client,
connect to 10.27.13.121, 6700, config: , buffer_size: 5242880
2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [0]
2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881
http://10.27.10.180:33881 = /10.27.13.121:6700
http://10.27.13.121:6700]




--
- varun :)




--
- varun :)


Re: secure storm UI

2014-09-26 Thread Derek Dagit
This is available in the security branch.  See 
https://github.com/apache/storm/blob/security/SECURITY.md


You do not need to enable all of the security features to get UI auth.

For authentication, look at ui.filter and ui.filter.params.

For authorization, nimbus.admins, ui.users, logs.users, and topology.users
--
Derek

On 9/26/14 14:34, Kushan Maskey wrote:

Is there a way to secure the storm UI page. Like enable logging to
access the page so only authorized people can only access it.

--
Kushan Maskey
817.403.7500
M. Miller  Associates http://mmillerassociates.com/
kushan.mas...@mmillerassociates.com
mailto:kushan.mas...@mmillerassociates.com


Re: Please fix the code samples in the documentation

2014-09-02 Thread Derek Dagit

I think this has been pointed out before.  It is being tracked:

https://issues.apache.org/jira/browse/STORM-385
--
Derek

On 9/2/14, 15:34, Andras Hatvani wrote:

Hi,

To the Storm-developers: Please fix the code samples in the documentation, 
because currently every single one is unformatted, without syntax highlighting 
and in one row.

Thanks in advance,
Andras



Re: [DISCUSS] Apache Storm Release 0.9.3/0.10.0

2014-08-28 Thread Derek Dagit

I am supportive.

I think it makes sense to move to 0.10.0 because of the significance of the 
changes.
--
Derek

On 8/28/14, 15:34, P.Taylor Goetz wrote:

I’d like to gather community feedback for the next two releases of Apache Storm.

0.9.3-incubating will be our next release. Please indicate (by JIRA ticket ID) 
which bug fixes and/or new features you would like to be considered for 
inclusion in the next release. If there is not an existing for a particular 
issue or feature, please consider adding one.

For the next and subsequent releases, we will be using a slightly different 
approach than what we did in the past. Instead of voting right away on a build, 
we will make one or more “unofficial” release candidate builds available prior 
to voting on an official release. This will give the Apache Storm community 
more time to discuss, evaluate, identify and fix potential issues before the 
official release. This should enable us to ensure the final release is as bug 
free as possible.

Apache Storm 0.10.0 (STORM-216)
As some of you are aware, the engineering team at Yahoo! has done a lot of work 
to bring security and multi-tenancy to Storm, and has contributed that work 
back to the community. Over the past few months we have been in the process of 
enhancing and syncing that work  with the master branch in a separate branch 
labeled “security.” That work is now nearing completion, and I would like us to 
consider merging it into master after the 0.9.3 release. Since the security 
work includes a large number of changes and enhancements, I propose we bump the 
version number to 0.10.0 for the first release to include those features.

More information about the security branch can be found in this pull request 
[1], as well as the SECURITY.md file in the security branch [2]. I also 
discussed it in a blog post [3] on the Hortonworks website. Please feel free to 
direct any comments or questions about the security branch to the mailing list.

Similar to the process we’ll follow for 0.9.3, we plan to make several 
unofficial “development” builds available for those who would like to help with 
testing the new security features.


-Taylor

[1] https://github.com/apache/incubator-storm/pull/121
[2] https://github.com/apache/incubator-storm/blob/security/SECURITY.md
[3] http://hortonworks.com/blog/the-future-of-apache-storm/



Re: Create multiple supervisors on same node

2014-08-22 Thread Derek Dagit

I also tried another scenario: instead of copying the entire storm home 
directory, I only use one storm home, but different storm-local dir and ports, 
which both are specified in storm.yaml, I can still create multiple 
supervisors. (Of course, every time before I start a new supervisor, I have to 
update the storm.yaml for different storm-local dir and ports).


You will have two supervisors writing to the same log.

I recommend creating two distinct storm home directories unless you have a good 
reason to have them shared.  I think the code assumes it is the only supervisor 
writing in storm home.
--
Derek

On 8/22/14, 14:08, Yu, Tao wrote:

Thanks Harsha!

Just cleaned zookeeper data (stop and re-start zookeeper) and tried again, now 
I can create multiple supervisors successfully!

I also tried another scenario: instead of copying the entire storm home 
directory, I only use one storm home, but different storm-local dir and ports, 
which both are specified in storm.yaml, I can still create multiple 
supervisors. (Of course, every time before I start a new supervisor, I have to 
update the storm.yaml for different storm-local dir and ports).

So my new questions are:

1) what's the best approach to create multiple supervisors on same node:

 a) each supervisor has it's own storm home directory;Or
 b) all supervisors have common storm home directory.

 in both approaches, supervisors have its own storm-local dir and ports.

2) when start supervisor, can we specify storm to use custom configuration 
(.yaml)? For example, like:

   $bin/storm supervisor   --config  conf/myConfig.yaml

  Seems like storm will always use conf/storm.yaml, and I do not see any 
document mentions about specifying custom config file.

Thanks,
-Tao

-Original Message-
From: Harsha [mailto:st...@harsha.io]
Sent: Friday, August 22, 2014 12:57 PM
To: user@storm.incubator.apache.org
Subject: Re: Create multiple supervisors on same node

Tao,
I tried the above steps I am able to run two supervisors on the
same node. Did you check the logs for supervisor under storm2. If
it didn't created a local_dir/storm dir than your supervisor
daemon might not be running. check for logs if there are any
errors.
-Harsha

On Fri, Aug 22, 2014, at 09:20 AM, Yu, Tao wrote:

Thanks Harsha!

I tried your way, and here is what I have (major parts) in my storm.yaml:

  storm.local.dir: /opt/grid/tao/storm/storm-0.8.2/local_data/storm
  supervisor.slots.ports:
 - 6700
 - 6701

1) I created the 1st supervisor, and I can see specified  sub-folder
local_data/storm/supervisor was created under
opt/grid/tao/storm/storm-0.8.2.  That's OK!

2) then I copied the entire storm-0.8.2 folder to a new storm2
(/opt/grid/tao/storm/storm2)

3) delete the sub-folder local_data under storm2

4) updated the storm.yaml under storm2 with below change:

  storm.local.dir: /opt/grid/tao/storm/storm2/local_data/storm
  supervisor.slots.ports:
 - 8700
 - 8701

5) under storm2, create a new supervisor.

Then the new supervisor still has the 1st supervisor's ID.  And under
storm2, the sub-folder local_data/storm was not created.

Does storm still use the 1st storm home directory
(storm/storm-0.8.2) local_data folder?

Thanks,
-Tao

-Original Message-
From: Harsha [mailto:st...@harsha.io]
Sent: Friday, August 22, 2014 11:28 AM
To: user@storm.incubator.apache.org
Subject: Re: Create multiple supervisors on same node

Tao,
  you need to delete the storm-local dir under your copied over storm
  dir ( storm2). Otherwise it will still pick up the same
  supervisor-id.
-Harsha

On Fri, Aug 22, 2014, at 08:16 AM, Yu, Tao wrote:

Thanks Derek!

I tried your suggestion, copied the entire storm home directory
(which, in my case, is storm-0.8.2) to a new directory storm2,
then in storm2 directory, I changed the conf/storm.yaml with
different ports, and tried to create a new supervisor. Still, got
the same supervisor ID as the 1st one (which I created from storm-0.8.2 
directory).

Did I do anything incorrectly?

-Tao

-Original Message-
From: Derek Dagit [mailto:der...@yahoo-inc.com]
Sent: Friday, August 22, 2014 11:01 AM
To: user@storm.incubator.apache.org
Subject: Re: Create multiple supervisors on same node

The two supervisors are sharing the same state, and that is how they
get the same randomly-generated ID.

If I recall correctly, the default state directory is created in the
current working directory of the process, so that is whatever
directory you happen to be in when you start the supervisor.

I think probably a good thing to do is copy the entire storm home
directory, change the storm.yaml in the copy to be configured with
different ports as you tried, and make sure to cd into the
appropriate directory when you launch the supervisor.

--
Derek

On 8/22/14, 9:49, Yu, Tao wrote:

Hi all,

Anyone knows what's the requirement

Re: worker always timeout to heartbeat and was restarted by supervisor

2014-08-20 Thread Derek Dagit

1) does it related to GC problem?


This is usually the cause.  As the worker consumes more and more of the heap, 
gc takes longer and longer each time.  Eventually it takes so long the 
heartbeats to the supervisor do not happen.

There could be a spike or skew in your data such that one or more workers 
cannot handle it with their heap settings.

--
Derek

On 8/20/14, 5:26, DashengJu wrote:

hi, all

In out production environment, we have a topology named
logparser_mobile_nginx,it has 50 worker, spout have 48 executors,
bolt_parser have 1000 executors and bolt_saver have 50 executors.

The topology running normal most of times, but 1~5 workers restarted every
1~2 hours. When we see the logs of supervisor and worker, found 1) worker
have no error or exception; 2) supervisor says the worker did not do
heartbeat and timeout happened.

because worker have no log, I do not know why worker did not do heartbeat.
anyone have any ideas  how to investigate?
0) is the worker exist caused?
1) does it related to GC problem?
2) does it related to Memory problem? If this, I think the JVM will report
Memory Exception in worker log.

By the way, some small topologies works well on the same environment.

below is the supervisor log:
--
2014-08-20 15:51:33 b.s.d.supervisor [INFO]
90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started

2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
state for id c7e8d375-db76-4e2
9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
:port 9714}
2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
75-db76-4e29-8019-e783ab3cd6de
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
Process is probably already dead
.
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
b76-4e29-8019-e783ab3cd6de

2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d5a8d578-89ff-4a5
0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
logparser_nginx-265-1408521077, :
executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
78-89ff-4a50-a906-75e847ac63a1
2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
Process is probably already dead
.
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
9ff-4a50-a906-75e847ac63a1

2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 5154f643-cd79-411
9-9368-153f1bede757. Current supervisor time: 1408521676. State:
:timed-out, Heartbeat: #backtype.sto
rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
logparser_mobile_nginx-259-1408518
662, :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
:port 9720}
2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
43-cd79-4119-9368-153f1bede757
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
Process is probably already dead.
2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
Process is probably already dead.
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757

2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor time:
1408521676. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
:storm-id app_upload_urls-218-1408503096, :executors #{[8 8] [40 40] [24
24] [-1 -1]}, :port 9713}
2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
Process is probably already dead.
2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba



Re: NoSuchMethorError

2014-08-04 Thread Derek Dagit

I skimmed grepcode, and found that Yaml(BaseConstructor) was available from 
snakeyaml version 1.7 onward.

I would check if a version of snakeyaml = 1.6 is in your classpath somehow.

--
Derek

On 8/4/14, 14:34, Ratay, Steve wrote:

I am trying to run a local cluster using Storm 0.9.2, and getting a 
NoSuchMethodError.  I am using Eclipse and have pulled all the Storm 
dependencies into my project.  Most notably, I have the snakeyaml-1.11.jar 
file.  Anyone else seeing this error or know where I've gone wrong?


java.lang.NoSuchMethodError: 
org.yaml.snakeyaml.Yaml.init(Lorg/yaml/snakeyaml/constructor/BaseConstructor;)V

at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:144) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at backtype.storm.utils.Utils.readDefaultConfig(Utils.java:167) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at backtype.storm.utils.Utils.readStormConfig(Utils.java:191) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at backtype.storm.config$read_storm_config.invoke(config.clj:121) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at backtype.storm.testing$mk_local_storm_cluster.doInvoke(testing.clj:123) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at clojure.lang.RestFn.invoke(RestFn.java:421) ~[clojure-1.5.1.jar:na]

at backtype.storm.LocalCluster$_init.invoke(LocalCluster.clj:28) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at backtype.storm.LocalCluster.init(Unknown Source) 
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]

at analytics.AnalyticsTopology.main(AnalyticsTopology.java:38) ~[classes/:na]

Thanks, Steve



Re: intra-topology SSL transport

2014-07-23 Thread Derek Dagit

In the security branch of storm, worker-worker communication are encrypted 
(blowfish) with a shared secret.

STORM-348 will add authentication to worker-worker.

For thrift (nimbus  drpc), the security branch has SASL/kerberos 
authentication, and you should be able to configure encryption via SASL as well.  
We have not tried enabling encryption with SASL.
--
Derek

On 7/23/14, 14:05, Isaac Councill wrote:

Hi,

I've been working with storm on mesos but I need to make sure all workers
are messaging over SSL since streams may contain sensitive information for
almost all of my use cases.

stunnel seems like a viable option but I dislike having complex port
forwarding arrangements and would prefer code to config in this case.

As an exercise to see how much work it would be, I forked storm and
modified the storm-netty package to use SSL with the existing nio. Not so
bad, and lein tests pass.

Still wrapping my head around the storm codebase. Would using my modified
storm-netty Context as storm.messaging.transport be enough to ensure
streams are encrypted, or would I need to also attack the thrift transport
plugin?

Also, is anyone else interested in locking storm down with SSL?



Re: Measuring a topology performance

2014-07-14 Thread Derek Dagit
 What's the recommended way to measure the avg. time of the tuple spending in 
 the topology until its full processing?

You can do this with acking enabled.  In the UI, go to a spout and look for 
Complete Latency.

-- 
Derek

On 7/13/14, 7:03, 唐 思成 wrote:
 UI has metric called latency means how long a bolt take to process a tuple
 在 2014年7月13日,下午5:49,Vladi Feigin vladi...@gmail.com 写道:
 
 Hi All,

 What's the recommended way to measure the avg. time of the tuple spending in 
 the topology until its full processing?
 We use Storm version 0.8.2 and have the topologies with acks and without.

 Thank you,
 Vladi
 


Re: Storm UI: not displayed Executors

2014-07-11 Thread Derek Dagit

This should be fixed with either STORM-370 (merged) or STORM-379 (pull request 
open).

The script that sorts the table did not check that the list was of non-zero 
size before it attempted to sort, and that resulted in an exception that halted 
subsequent rendering on the page.

You can:
- checkout the latest storm and use that
- cherry-pick commit 31c786c into the version you are using.
--
Derek

On 7/11/14, 5:54, 川原駿 wrote:

Hello.

Recently I upgraded storm to 0.9.2.

In Component summary page of Storm UI, Executors is not displayed only
when emitted of its spout/bolt is 0.

Please tell me the solution.

Thanks.



Re: Spout process latency

2014-07-09 Thread Derek Dagit

It should be a windowed average measure of the time between when the component 
receives a tuple and when it acks the tuple.

This can be slower if there is batching, aggregating, or joining happening (the 
component must wait for a number of other tuples to arrive before it can ack).

On the UI, there are tool tips that explain the measurements.  They appear 
after hovering over the label.
--
Derek

On 7/9/14, 15:22, Raphael Hsieh wrote:

Can somebody explain to me what might cause the spout to have a large
process latency? Currently my spout0 and $spoutcoord-spout0 have latency's
higher than I would like.
I'm consuming data from a Kafka stream.

How is this process latency measured ?
Is it measuring the amount of time it takes to fill a batch with data and
send it to the first bolt in the topology?

Thanks



Re: what does each field of storm UI mean?

2014-06-18 Thread Derek Dagit

Adrian,

If you hover over the title of the field, there should appear a pop-up to 
explain what it means.

--
Derek

On 6/17/14, 21:05, 이승진 wrote:

Dear storm users

I want to see performance of each bolt and decide the number of parallelism.
In storm UI there are several fields which is confusing, so would be glad if 
you can tell me.
Capacity(last 10m) - average capacity per one second in last 10 minute of a 
single executor? For example, if Capcity is 1.2, does that mean single executor 
processed 1.2 messages per second in average?
Execute latency and Process latency - Is it average value or value of last 
processed message? and what is the difference between them? and what is the 
difference between them and Capacity?

Sincerly,
Adrian SJ Lee





Re: Supervisor kills *all* workers for topology due to heartbeat :timed-out state

2014-06-16 Thread Derek Dagit

:timed-out means that the worker did not heartbeat to the supervisor in time. 
(This happens on local disk.)

Check that your workers have enough jvm heap space.  If not, garbage collection 
for the JVM will cause progressively slower heartbeats until the supervisor 
thinks they are dead and kills them.

topology.worker.childopts=-Xmx{{VALUE}} e.g. 2048m or 2g

--
Derek

On 6/14/14, 22:39, Justin Workman wrote:

 From what I have seen, if nimbus kills and reassigns the worker process,
the supervisor logs will report that the worker is in a disallowed state.

I have seen the supervisor report the worker in a timed out state and
restart the worker processes, generally when the system is under heavy CPU
load. We recently ran into this issue while running a topology on virtual
machines. Increasing the number of virtual cores assigned to the vm's
resolved the restart issues.

Thanks
Justin

Sent from my iPhone

On Jun 14, 2014, at 11:32 AM, Andrew Montalenti and...@parsely.com wrote:

I am trying to understand why for a topology I am trying to run on
0.9.1-incubating, the supervisor on the machine is killing *all* of the
topology's Storm workers periodically.

Whether I use topology.workers=1,2,4, or 8, I always get logs like this:

https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b

Which basically indicates that the supervisor thinks all the workers timed
out at exactly the same time, and then it kills them all.

I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120
secs, but this hasn't helped at all. No matter what, periodically, the
workers just get whacked by the supervisor and the whole topology has to
restart.

I notice that this does happen less frequently if the machine is under less
load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or
200, then it runs for awhile without crashing. But I've even seen it crash
in this state.

I saw on some other threads that people indicated that the supervisor will
kill all workers if the nimbus fails to see a heartbeat from zookeeper.
Could someone walk me through how I could figure out if this is the case?
Nothing in the logs seems to point me in this direction.

Thanks!

Andrew



Re: [VOTE] Storm Logo Contest - Final Round

2014-06-09 Thread Derek Dagit

#6 - 2pt.
#9 - 2pts.
#10 - 1pt.

--
Derek

On 6/9/14, 13:38, P. Taylor Goetz wrote:

This is a call to vote on selecting the winning Storm logo from the 3 finalists.

The three candidates are:

  * [No. 6 - Alec 
Bartos](http://storm.incubator.apache.org/2014/04/23/logo-abartos.html)
  * [No. 9 - Jennifer 
Lee](http://storm.incubator.apache.org/2014/04/29/logo-jlee1.html)
  * [No. 10 - Jennifer 
Lee](http://storm.incubator.apache.org/2014/04/29/logo-jlee2.html)

VOTING

Each person can cast a single vote. A vote consists of 5 points that can be 
divided among multiple entries. To vote, list the entry number, followed by the 
number of points assigned. For example:

#1 - 2 pts.
#2 - 1 pt.
#3 - 2 pts.

Votes cast by PPMC members are considered binding, but voting is open to 
anyone. In the event of a tie vote from the PPMC, votes from the community will 
be used to break the tie.

This vote will be open until Monday, June 16 11:59 PM UTC.

- Taylor



Re: Workers constantly restarted due to session timeout

2014-06-03 Thread Derek Dagit

1) Is it appropriate to run Zookeeper in parallel on the same node with the 
storm services?


I recommend separate, and even then to ZK storage to a path on its own disk 
device if possible.  ZK is a bottleneck for storm, and when it is too slow lots 
of bad things can happen.

Some folks use shared hosts (with or without VMs) in which to run ZK.  In those 
situations, VMs or processes owned by other users doing unrelated things can 
put load on the disk, and that will dramatically slow down ZK.



2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its 
client. Should we downgrade our installation?


I am not sure about that, since we've been running with ZK 3.4.5 in storm (and 
on the server).  It might work very well, but I have not tried it.  I do not 
remember if anyone on this list has identified any issues with this 3.3.3 + 
3.4.5 combo.


One setting we changed to dramatically improve performance with ZK was setting 
the system property '-Dzookeeper.forceSync=no' on the server.

Normally, ZK will sync to disk on every write, and that causes two seeks: one 
for the data and one for the data log.  It gets really expensive with all of 
the workers heartbeating in through ZK.  Be warned that with only on ZK server, 
an outage could leave you in an inconsistent state.

You might check to see if the ZK server is keeping up.  There are tools like 
iotop that can give information about disk load.

--
Derek

On 6/3/14, 13:14, Michael Dev wrote:




Thank you Derek for the explanation between :disallowed and :timed-out. That was 
extremely helpful in understanding what decisions Storm is making. I increased the 
timeouts for both messages to 5 minutes and returned the zookeeper session timeouts to 
their default values. This made it plain to see periods in time where the 
Uptime column for the busiest component's Worker would not update (1-2 
minutes, potentially never resulting in a worker restart).

ZK logs report constant disconnects and reconnects while the Uptime is not 
updating:
16:28:30,440 - INFO NIOServerCnxn@1001 - Closed socket connection for client 
/10.49.21.151:54004 which has sessionid 0x1464f1fddc1018f
16:31:18,364 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from 
/10.49.21.151:34419
16.31:18,365 - WARN ZookeeperServer@793 - Connection request from old client 
/10.49.21.151:34419; will be dropped if server is in r-o mode
16:31:18,365 - INFO ZookeeperServer@832 - Client attempting to renew session 
0x264f1fddc4021e at /10.49.21.151:34419
16:31:18,365 - INFO Learner@107 - Revalidating client: 0x264f1fddc4021e
16:31:18,366 - INFO ZooKeeperServer@588 - Invalid session 0x264f1fddc4021e for 
client /10.49.21.151:34419, probably expired
16:31:18,366 - NIOServerCnxn@1001 - Closed socket connection for client 
/10.49.21.151:34419 which had sessionid 0x264f1fddc4021e
16:31:18,378 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from  
/10.49.21.151:34420
16:31:18,391 - WARN ZookeeperServer@793 - Connection request from old
client /10.49.21.151:34420; will be dropped if server is in r-o mode
16:31:18,392 - INFO ZookeeperServer@839 - Client attempting to establish new 
session at /10.49.21.151:34420
16:31:18,394 - INFO ZookeeperServer@595 - Established session 0x1464fafddc10218 
with negotiated timeout 2 for client /10.49.21.151:34420
16.31.44,002 - INFO NIOServerCnxn@1001 - Closed socket connection for 
/10.49.21.151:34420 which had sessionid 0x1464fafddc10218
16.32.48,055 - INFO NIOServerCxnFactory@197 - Accepted socket connection from 
/10.49.21.151:34432
16:32:48,056 - WARN ZookeeperServer@793 - Connection request from old
client /10.49.21.151:34432; will be dropped if server is in r-o mode
16.32.48,056 - INFO ZookeeperServer@832 - Client attempting to renew session 
0x2464fafddc4021f at /10.49.21.151:34432
16:32:48,056 - INFO Learner@107 - Revalidating client: 0x2464fafddc4021f
16:32:48,057 - INFO ZooKeeperServer@588 - Invalid session 0x2464fafddc4021f for 
client /10.49.21.151:34432, probably expired
16:32:48,057 - NIOServerCnxn@1001 - Closed socket connection for client
/10.49.21.151:34432 which had sessionid 0x2464fafddc4021f
...etc until Storm has had enough and restarts the worker resulting in this
16:47:20,706 - NIOServerCnxn@349 - Caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x3464f20777e01cf, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.langThread.run(Thread.java:745)

1) Is it appropriate to run Zookeeper in parallel on the same node with the 
storm services?
2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its 
client. Should we downgrade our installation?


Date: Sat, 31 May 2014 13:50:57 -0500
From: der...@yahoo-inc.com
To: user@storm.incubator.apache.org
Subject: Re: Workers 

Re: Workers constantly restarted due to session timeout

2014-05-31 Thread Derek Dagit

Are you certain that nimbus.task.timeout.secs is the correct config?


That config controls the length of time before nimbus thinks a worker has timed 
out.

https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372

Its default is 30 seconds.

https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45



storm.zookeeper.connection.timeout: 30
storm.zookeeper.session.timeout: 30


So these will make the situation worse while workers losing connections to ZK, 
since it will cause the workers to wait longer before reconnecting.  They could 
wait until nimbus thinks the worker is dead before trying to reconnect.



supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for 
id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: 
:disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 
1400876249, :storm-id test-46-1400863199, :executors #{[-1 -1]}, :port 6700}


Here if the State is :disallowed, then that means it is Nimbus that de-scheduled the 
worker on that node--very probably in this case because it thought it was dead.  When the supervisor sees 
this, it will kill the worker.  (A state of :timed-out means instead that the worker did not 
heartbeat to its supervisor in time.)

If the CPU load on the worker was high enough to prevent heartbeats, then I 
would expect to see :timed-out state above instead of :disallowed.  The reason 
is that the worker has only 5 seconds to do those heartbeats, while it has 30 
seconds to heartbeat to nimbus (via ZK).  (More often what happens to cause 
this is memory has run out and garbage collection stops everything just long 
enough.)

The real question is why connections from the worker to ZK are timing out in 
the first place.

What about the ZK servers?  Sometimes ZooKeeper servers cannot keep up, and 
that causes pretty severe problems with timeouts.
--
Derek

On 5/30/14, 17:51, Michael Dev wrote:




Michael R,

We don't have GC logging enabled yet. I lean towards agreeing with Derek that I 
don't think it's the issue but I will take a look at logging on Monday just to 
verify.

Derek D,

Are you certain that nimbus.task.timeout.secs is the correct config? Tracing 
through the github code it would seem to me that this is used as the timeout 
value when making a Thrift connection to the Nimbus node. I thought the logs 
indicated the timeout was occurring in the session connection to zookeeper as 
evidenced by ClientCxn being a Zookeeper class.

I discovered that we were running with the default maxSessionTimeout zookeeper 
config of 40 seconds. This would explain why our storm config of 5 minutes was 
not being picked up (but obviously not the root problem nor why timeout 
messages report 14-20 second timeout values). Typically we saw losses in 
connection occur when our cluster becomes super busy with a burst of data 
pushing workers to near 100% CPU. I'm testing the following configs over the 
weekend to see if they at least allow us to prevent chronic worker restarting 
during the brief high CPU periods.

Our current setup is as follows:
Storm 0.9.0.1
3 Storm node cluster
1 Supervisor running per Storm node
1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
3 Workers per topology
Variable number of executors per component depending on how slow that component 
is. Example file i/o has many executors (say 12) while in memory validation has 
only 3 executors. Always maintaining a multiple of the number of workers for 
even distribution.
Kyro serialization with Java Serialization failover disabled to ensure we're 
using 100% kryo between bolts.

zoo.cfg
tickTime=2000
dataDir=/srv/zookeeper/data
clientPort=2182
initLimit=5
syncLimit=2
skipACL=true
maxClientCnxns=1000
maxSessionTimeout=30
server.1=node1
server.2=node2
server.3=node3

storm yaml
storm.zookeeper.port: 2182
storm.local.dir=/srv/storm/data
nimbus.host: node1
storm.zookeeper.servers:
  - node1
  - node2
  - node3
supervisor.slot.ports:
  - 6700
  - 6701
  - 6702
  - 6703
  - 6704
java.library.path: /usr/lib:/srv/storm/lib
#Storm 0.9 netty support
storm.messaging.transport: backtype.storm.messaging.netty.Context
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100
# Timeout band-aids in testing
topology.receiver.buffer.size: 2
storm.zookeeper.connection.timeout: 30
storm.zookeeper.session.timeout: 30



Date: Thu, 29 May 2014 12:56:19 -0500
From: der...@yahoo-inc.com
To: user@storm.incubator.apache.org
Subject: Re: Workers constantly restarted due to session timeout

OK, so GC is probably not the issue.


Specifically, this is a connection timeout to ZK from the worker, and it is 
resulting in nimbus 

Re: Workers constantly restarted due to session timeout

2014-05-23 Thread Derek Dagit

2) Is this expected behavior for Storm to be unable to keep up with heartbeat 
threads under high CPU or is our theory incorrect?


Check your JVM max heap size (-Xmx).  If you use too much, the JVM will 
garbage-collect, and that will stop everything--including the thread whose job 
it is to do the heartbeating.



--
Derek

On 5/23/14, 15:38, Michael Dev wrote:

Hi all,

We are seeing our workers constantly being killed by Storm with to the 
following logs:
worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, 
have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, 
closing socket and attempting reconnect
supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for 
id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: 
:disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 
1400876249, :storm-id test-46-1400863199, :executors #{[-1 -1]}, :port 6700}

Eventually Storm decides to just kill the worker and restart it as you see in 
the supervisor log. We theorize this is the Zookeeper heartbeat thread and it 
is being choked out due to very high CPU load on the machine (near 100%).

I have increased the connection timeouts in the storm.yaml config file yet 
Storm seems to continue to use some unknown value for the above client session 
timeout messages:
storm.zookeeper.connection.timeout: 30
storm.zookeeper.session.timeout: 30

1) What timeout config is appropriate for the above timeout  message?
2) Is this expected behavior for Storm to be unable to keep up with heartbeat 
threads under high CPU or is our theory incorrect?

Thanks,
Michael




Re: Test timed out (5000ms)

2014-05-19 Thread Derek Dagit

https://git.corp.yahoo.com/storm/storm/blob/master-security/storm-core/src/clj/backtype/storm/testing.clj#L167

Try changing this.  The time-out was added to prevent the case when a test 
would hang indefinitely.  Five seconds was thought to be more than enough time 
to let tests pass.  If it needs to be longer we could increase it.

If you continue to see the time-out, it could be that the test really is 
hanging somehow.

--
Derek

On 5/19/14, 4:57, Sergey Pichkurov wrote:

Hello, Storm community.



I trying to write unit test with

storm.version :0.9.1-incubating

storm-kafka-0.8-plus: 0.4.0



My topology have one Kafka spout and one storing bolt which has Spring
inside(context initialized in prepare() method).

When I running test with Testing.completeTopology(), I am getting error:


java.lang.AssertionError: Test timed out (5000ms)

 at
backtype.storm.testing$complete_topology.doInvoke(testing.clj:475)

 at clojure.lang.RestFn.invoke(RestFn.java:826)

 at
backtype.storm.testing4j$_completeTopology.invoke(testing4j.clj:61)

 at backtype.storm.Testing.completeTopology(Unknown Source)



This error not always arise, sometime test pass successfully.



Where I can change this timeout parameter? Or how can I disable this timeout?



Re: Test timed out (5000ms)

2014-05-19 Thread Derek Dagit

https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/testing.clj#L187

Corrected link.

--
Derek

On 5/19/14, 10:10, Sergey Pichkurov wrote:

I think that 5 sec is not always enought for Spring init.

Your link is not resolving.

*Pichkurov Sergey, Java Developer*



On Mon, May 19, 2014 at 5:18 PM, Derek Dagit der...@yahoo-inc.com wrote:





Try changing this.  The time-out was added to prevent the case when a test
would hang indefinitely.  Five seconds was thought to be more than enough
time to let tests pass.  If it needs to be longer we could increase it.

If you continue to see the time-out, it could be that the test really is
hanging somehow.

--
Derek


On 5/19/14, 4:57, Sergey Pichkurov wrote:


Hello, Storm community.



I trying to write unit test with

storm.version :0.9.1-incubating

storm-kafka-0.8-plus: 0.4.0



My topology have one Kafka spout and one storing bolt which has Spring
inside(context initialized in prepare() method).

When I running test with Testing.completeTopology(), I am getting error:


java.lang.AssertionError: Test timed out (5000ms)

  at
backtype.storm.testing$complete_topology.doInvoke(testing.clj:475)

  at clojure.lang.RestFn.invoke(RestFn.java:826)

  at
backtype.storm.testing4j$_completeTopology.invoke(testing4j.clj:61)

  at backtype.storm.Testing.completeTopology(Unknown
Source)



This error not always arise, sometime test pass successfully.



Where I can change this timeout parameter? Or how can I disable this
timeout?






Re: Weirdness running topology on multiple nodes

2014-05-16 Thread Derek Dagit

That is odd.  I have seen things like this happen when there are DNS 
configuration issues, but you have even updated /etc/hosts.


* What does /etc/nsswitch.conf have for the hosts entry?

This is what mine has:
hosts:  files dns

I think that the java resolver code honors this setting, and this will cause it 
to look at /etc/hosts first for resolution.


* Firewall settings could also cause this.  (Pings would work while 
worker-worker communications might not.)


* Failing that, maybe watch network packets to discover with what the workers 
really trying to communicate?
--
Derek

On 5/7/14, 10:11, Justin Workman wrote:

We have spent the better part of 2 weeks now trying to get a pretty basic
topology running across multiple nodes. I am sure I am missing something
simple but for the life of me I cannot figure it out.

Here is the situation, I have 1 nimbus server and 5 supervisor servers,
with Zookeeper running on the nimbus server and 2 supervisor nodes. These
hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack
deployment. If all of the guests are running on the same physical hyperisor
then the topology starts up just fine and runs without any issues. However,
if we take the guests and spread them out over multiple hypervisors ( in
the same OpenStack cluster ), the topology never really completely starts
up. Things start to run, some messages are pulled off the spout, but
nothing ever makes it all the way through the topology and nothing is ever
ack'd.

In the worker logs we get messages about reconnecting and eventually a
Remote host unreachable error, and Async Loop Died. This used to result in
a NumberFormat exception, reducing the netty retries from 30 to 10 resloved
the NumberFormat error, and not we get the following

2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Client is being
closed, and does not take requests any more
 at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89)
~[na:na]
 at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433)
~[na:na]
 at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
 at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
Caused by: java.lang.RuntimeException: Client is being closed, and does not
take requests any more
 at backtype.storm.messaging.netty.Client.send(Client.java:125)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
~[na:na]
 at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
~[na:na]
 at
backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
~[na:na]

And in the supervisor logs we see errors about the workers timing out and
not starting up all the way, we also see executor timeouts in the nimbus
logs. But we do not see any errors in the Zookeeper logs and the Zookeeper
stats look fine.

There do not appear to be any real network issues, I can run a continuous
flood ping, between the hosts, with varying packet sizes, with minimal
latency, and no dropped packets. I have also attempted to add all hosts to
the local hosts files on each machine without any difference.

I have also played with adjusting the different heartbeat timeouts and
intervals with out any luck, and I have also deployed this same setup to a
5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local
disks ), and we had the same issue. Topology would start, but data ever
made it through the topology.

The only way I have ever been able to get the topology to work 

Re: Topologies are disappearing??? How to debug?

2014-05-01 Thread Derek Dagit

Make sure you do not have a second nimbus daemon running by accident.

I saw this one time after someone had launched nimbus on a different host, yet 
the file system on which nimbus was storing its state was an NFS mount.

It took a comically long time to figure out that a the second remote nimbus 
daemon was clearing state as soon as the first local daemon was writing it.
--
Derek

On 5/1/14, 11:59, Software Dev wrote:

Over the last several days some/all of our topologies are disappearing
from Nimbus. What are some possible explanations for this? Where
should I look to debug this problem?

Thanks



Re: Is it a must to have /etc/hosts mapping or a DNS in a multinode setup?

2014-04-29 Thread Derek Dagit

I have not tried it, but there is a config for this purpose:

https://github.com/apache/incubator-storm/blob/dc4de425eef5701ccafe0805f08eeb011baae0fb/storm-core/src/jvm/backtype/storm/Config.java#L122-L131
--
Derek

On 4/29/14, 0:41, Sajith wrote:

Hi all,

Is it a must to have a /etc/hosts mapping or a DNS in a multinode storm
cluster? Can't supervisors talk to each other through ZooKeeper or nimbus
using IP addresses directly ?



Re: storm starter ExclamationTopology

2014-04-24 Thread Derek Dagit

In your storm cluster , you need to verify first, nimbus is running
properly or not. Check nimbus.log in $STORM_HOME/logs directory for error
logs.
Also check nimbus.host parameter in ~/.storm/storm.yaml.


Yeah, that's what I was writing in my reply.  I'll go ahead and add below:


Start nimbus again, and make sure it is up.

If your nimbus host is the same host, try (assuming from here nimbus port is 
6627):

```
$ telnet localhost 6627
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
```

If you see that, then nimbus is up-and-running (accepting connections at least).
Check:
- storm.yaml files have correct nimbus.host and nimbus.thrift.port
- firewall settings
- routing (What interface did nimbus open a port on? `netstat -lnt | grep 6627`)


If not, check:
- Make sure likewise ZooKeeper is running.
- logs/nimbus.log (Is there some other issue?)


--
Derek

On 4/24/14, 11:48, Nishu wrote:

In your storm cluster , you need to verify first, nimbus is running
properly or not. Check nimbus.log in $STORM_HOME/logs directory for error
logs.
Also check nimbus.host parameter in ~/.storm/storm.yaml.


On Thu, Apr 24, 2014 at 9:56 PM, Bilal Al Fartakh
alfartaj.bi...@gmail.comwrote:


and the question is , what should I fix dear experts ? :)


2014-04-24 16:23 GMT+00:00 Bilal Al Fartakh alfartaj.bi...@gmail.com:

  ~/src/storm-0.8.1/bin/storm jar

/root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar
storm.starter.ExclamationTopology demo

*I tried to run this and it said that the problem is with the nimbus
connection , but my storm client (and supervisor in the same time ) is
connected with my nimbus (shown in Strom UI )*

Running: java -client -Dstorm.options= -Dstorm.home=/root/src/storm-0.8.1
-Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -cp
/root/src/storm-0.8.1/storm-0.8.1.jar:/root/src/storm-0.8.1/lib/asm-4.0.jar:/root/src/storm-0.8.1/lib/commons-codec-1.4.jar:/root/src/storm-0.8.1/lib/carbonite-1.5.0.jar:/root/src/storm-0.8.1/lib/kryo-2.17.jar:/root/src/storm-0.8.1/lib/clout-0.4.1.jar:/root/src/storm-0.8.1/lib/clojure-1.4.0.jar:/root/src/storm-0.8.1/lib/ring-servlet-0.3.11.jar:/root/src/storm-0.8.1/lib/hiccup-0.3.6.jar:/root/src/storm-0.8.1/lib/disruptor-2.10.1.jar:/root/src/storm-0.8.1/lib/tools.cli-0.2.2.jar:/root/src/storm-0.8.1/lib/snakeyaml-1.9.jar:/root/src/storm-0.8.1/lib/joda-time-2.0.jar:/root/src/storm-0.8.1/lib/jetty-util-6.1.26.jar:/root/src/storm-0.8.1/lib/commons-exec-1.1.jar:/root/src/storm-0.8.1/lib/jetty-6.1.26.jar:/root/src/storm-0.8.1/lib/servlet-api-2.5.jar:/root/src/storm-0.8.1/lib/jzmq-2.1.0.jar:/root/src/storm-0.8.1/lib/curator-framework-1.0.1.jar:/root/src/storm-0.8.1/lib/httpclient-4.1.1.jar:/root/src/storm-0.8.1/lib/slf4j-log4j12-1.5.8.jar:/root/src/storm-0.8.1/lib/clj-time-0.4.1.jar:/roo

t/src/storm-0.8.1/lib/commons-lang-2.5.jar:/root/src/storm-0.8.1/lib/libthrift7-0.7.0.jar:/root/src/storm-0.8.1/lib/log4j-1.2.16.jar:/root/src/storm-0.8.1/lib/servlet-api-2.5-20081211.jar:/root/src/storm-0.8.1/lib/tools.logging-0.2.3.jar:/root/src/storm-0.8.1/lib/ring-core-0.3.10.jar:/root/src/storm-0.8.1/lib/minlog-1.2.jar:/root/src/storm-0.8.1/lib/objenesis-1.2.jar:/root/src/storm-0.8.1/lib/jline-0.9.94.jar:/root/src/storm-0.8.1/lib/commons-io-1.4.jar:/root/src/storm-0.8.1/lib/ring-jetty-adapter-0.3.11.jar:/root/src/storm-0.8.1/lib/jgrapht-0.8.3.jar:/root/src/storm-0.8.1/lib/json-simple-1.1.jar:/root/src/storm-0.8.1/lib/tools.macro-0.1.0.jar:/root/src/storm-0.8.1/lib/commons-fileupload-1.2.1.jar:/root/src/storm-0.8.1/lib/compojure-0.6.4.jar:/root/src/storm-0.8.1/lib/httpcore-4.1.jar:/root/src/storm-0.8.1/lib/commons-logging-1.1.1.jar:/root/src/storm-0.8.1/lib/guava-13.0.jar:/root/src/storm-0.8.1/lib/curator-client-1.0.1.jar:/root/src/storm-0.8.1/lib/math.numeric-tower-0.0.1.jar:/roo
t/src/storm-0.8.1/lib/junit-3.8.1.jar:/root/src/storm-0.8.1/lib/slf4j-api-1.5.8.jar:/root/src/storm-0.8.1/lib/reflectasm-1.07-shaded.jar:/root/src/storm-0.8.1/lib/core.incubator-0.1.0.jar:/root/src/storm-0.8.1/lib/zookeeper-3.3.3.jar:/root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar:/root/.storm:/root/src/storm-0.8.1/bin

-Dstorm.jar=/root/src/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar
storm.starter.ExclamationTopology demo Exception in thread main
java.lang.RuntimeException:
org.apache.thrift7.transport.TTransportException:
java.net.ConnectException: Connection refused at
backtype.storm.utils.NimbusClient.(NimbusClient.java:36) at
backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:17)
at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:53) at
storm.starter.ExclamationTopology.main(ExclamationTopology.java:59) Caused
by: org.apache.thrift7.transport.TTransportException:
java.net.ConnectException: Connection refused at
org.apache.thrift7.transport.TSocket.open(TSocket.java:183) at

Re: Storm UI

2014-04-02 Thread Derek Dagit

it displays different stats depending on whether or not I show/hide system 
stats.


This is expected.  There should be a tool tip on that button that says something like 
toggle inclusion of system components.



How much the stats change seems to differ between Chrome and IE.


That button is setting a cookie's value to true/false.  The math is done on the 
server side and not in the browser, so the difference in the browser used 
should not matter beyond setting the cookie.

If the toggle is not taking effect at all for some browsers, then we should 
create a new Issue to take a look.

--
Derek

On 4/2/14, 8:04, David Crossland wrote:

I have a curious issue with the UI, it displays different stats depending on 
whether or not I show/hide system stats.  How much the stats change seems to 
differ between Chrome and IE.  Is this a known issue?

Thanks
David



Re: Error when trying to use multilang in a project built from scratch (not storm-starter)

2014-03-11 Thread Derek Dagit

Would you check your supervisor.log for a message like:

Could not extract  dir  from  jarpath

--
Derek

On 3/10/14, 17:26, Chris James wrote:

Derek: No I cannot cd into that directory, but I can cd into the directory one 
up from it (dummy-topology-1-1394418571).  That directory contains 
stormcode.ser and stormconf.ser files.  The topology is running locally for 
testing, so I'm not launching any separate supervisor daemon.  It just seems 
like it never even attempted to create the resources directory (but it 
successfully created all the ancestor directories), as the folder isn't really 
locked down at all.

P. Taylor: I get what you're implying, but Eclipse is being run as an 
administrator already, and I am debugging the topology locally straight out of 
eclipse.  It seems bizarre that there would be permissions issues on a folder 
that the project itself created.


On Mon, Mar 10, 2014 at 6:16 PM, Derek Dagit der...@yahoo-inc.com 
mailto:der...@yahoo-inc.com wrote:

Two quick thoughts:

- Can you cd to 
'C:\Users\chris\AppData\Local\__Temp\67daff0e-7348-46ee-9b62-__83f8ee4e431c\supervisor\__stormdist\dummy-topology-1-__1394418571\resources'
 from the shell as yourself?

- What are the permissions on that directory?  Is the supervisor daemon 
running as another user?

--
Derek


On 3/10/14, 17:05, P. Taylor Goetz wrote:

I don't have access to a windows machine at the moment, but does this 
help?

http://support.microsoft.com/__kb/832434 
http://support.microsoft.com/kb/832434

On Mar 10, 2014, at 4:51 PM, Chris James chris.james.cont...@gmail.com 
mailto:chris.james.cont...@gmail.com mailto:chris.james.contact@__gmail.com 
mailto:chris.james.cont...@gmail.com wrote:

Reposting since I posted this before at a poor time and got no 
response.

I'm trying out a storm project built from scratch in Java, but with 
a Python bolt.  I have everything running with all Java spouts/bolts just fine, 
but when I try to incorporate a python bolt I am running into issues.

I have my project separated into a /storm/ for topologies, 
/storm/bolts/ for bolts, /storm/spouts for spouts, and /storm/multilang/ for 
the multilang wrappers. Right now the only thing in /storm/multilang/ is 
storm.py, copied and pasted from the storm-starter project.  In my bolts 
folder, I have a dummy bolt set up that just prints the tuple.  I've virtually 
mimicked the storm-starter WordCountTopology example for using a python bolt, 
so I think the code is OK and the configuration is the issue.

So my question is simple.  What configuration steps do I have to set up so that my 
topology knows where to look to find storm.py when I run super(python, 
dummypythonbolt.py)?  I noticed an error in the stack trace claiming that it could not 
run python (python is definitely on my path and I use it everyday), and that is looking in a 
resources folder that does not exist.  Here is the line in question:

Caused by: java.io.IOException: Cannot run program python (in directory 
C:\Users\chris\AppData\Local\__Temp\67daff0e-7348-46ee-9b62-__83f8ee4e431c\supervisor\__stormdist\dummy-topology-1-__1394418571\resources):
 CreateProcess error=267, The directory name is invalid

A more extensive stack trace is here: http://pastebin.com/6yx97m0M

So once again: what is the configuration step that I am missing to 
allow my topology to see storm.py and be able to run multilang spouts/bolts in 
my topology?

Thanks!




Re: java.lang.OutOfMemoryError: Java heap space in Nimbus

2014-03-10 Thread Derek Dagit

Yes, set 'nimbus.childopts: -Xmx?' in your storm.yaml, and restart nimbus. If 
unset, I believe the default is -Xmx1024m, for a max of 1024 MB heap.

You can set it to -Xmx2048m, for example, to have a max heap size of 2048 MB.

Set this on the node that runs nimbus, not in your topology conf.

--
Derek

On 3/10/14, 14:19, shahab wrote:

Hi,

I am facing OutOfMemoryError: Java heap space exception in Nimbus while 
running in cluster mode. I just wonder what are the possible JVM or Storm  options that I 
can set to overcome this problem?

I am running a storm topology in Cluster mode where all servers (zookeeper, 
nimbus, supervisor and worker) are in one machine. The toplogy that I use is as 
follows:


conf.setMaxSpoutPending(2000); // maximum number of pending messages at spout
conf.setNumWorkers(4);
conf.put(Config.STORM_ZOOKEEPER_CONNECTION_TIMEOUT, 12);
conf.setMaxTaskParallelism(2);


but I get the following Exception in Nimbus log file:
java.lang.OutOfMemoryError: Java heap space
 at 
org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:271)
 at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:219)
 at 
org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1136)
 at 
backtype.storm.daemon.nimbus$read_storm_topology.invoke(nimbus.clj:305)
 at 
backtype.storm.daemon.nimbus$compute_executors.invoke(nimbus.clj:407)
 at 
backtype.storm.daemon.nimbus$compute_executor__GT_component.invoke(nimbus.clj:420)
 at 
backtype.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:315)
 at 
backtype.storm.daemon.nimbus$mk_assignments$iter__3416__3420$fn__3421.invoke(nimbus.clj:636)
 at clojure.lang.LazySeq.sval(LazySeq.java:42)
 at clojure.lang.LazySeq.seq(LazySeq.java:60)
 at clojure.lang.RT.seq(RT.java:473)
 at clojure.core$seq.invoke(core.clj:133)
 at clojure.core.protocols$seq_reduce.invoke(protocols.clj:30)
 at clojure.core.protocols$fn__5875.invoke(protocols.clj:54)
 at 
clojure.core.protocols$fn__5828$G__5823__5841.invoke(protocols.clj:13)
 at clojure.core$reduce.invoke(core.clj:6030)
 at clojure.core$into.invoke(core.clj:6077)
 at backtype.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:635)
 at clojure.lang.RestFn.invoke(RestFn.java:410)
 at 
backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto3593$fn__3598$fn__3599.invoke(nimbus.clj:872)
 at 
backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto3593$fn__3598.invoke(nimbus.clj:871)
 at 
backtype.storm.timer$schedule_recurring$this__1776.invoke(timer.clj:69)
 at backtype.storm.timer$mk_timer$fn__1759$fn__1760.invoke(timer.clj:33)
 at backtype.storm.timer$mk_timer$fn__1759.invoke(timer.clj:26)
 at clojure.lang.AFn.run(AFn.java:24)
 at java.lang.Thread.run(Thread.java:744)
2014-03-10 20:10:02 NIOServerCnxn [ERROR] Thread 
Thread[pool-4-thread-40,5,main] died
java.lang.OutOfMemoryError: Java heap space
 at 
org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:271)
 at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:219)
 at 
org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1136)
 at 
backtype.storm.daemon.nimbus$read_storm_topology.invoke(nimbus.clj:305)
 at 
backtype.storm.daemon.nimbus$fn__3592$exec_fn__1228__auto__$reify__3605.getTopologyInfo(nimbus.clj:1066)
 at 
backtype.storm.generated.Nimbus$Processor$getTopologyInfo.getResult(Nimbus.java:1481)
 at 
backtype.storm.generated.Nimbus$Processor$getTopologyInfo.getResult(Nimbus.java:1469)
 at org.apache.thrift7.ProcessFunction.process(ProcessFunction.java:32)
 at org.apache.thrift7.TBaseProcessor.process(TBaseProcessor.java:34)
 at 
org.apache.thrift7.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:632)
 at 
org.apache.thrift7.server.THsHaServer$Invocation.run(THsHaServer.java:201)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
2014-03-10 20:10:02 util [INFO] Halting process: (Error when processing an 
event)


best ,
/Shahab



Re: Error when trying to use multilang in a project built from scratch (not storm-starter)

2014-03-10 Thread Derek Dagit

Two quick thoughts:

- Can you cd to 
'C:\Users\chris\AppData\Local\Temp\67daff0e-7348-46ee-9b62-83f8ee4e431c\supervisor\stormdist\dummy-topology-1-1394418571\resources'
 from the shell as yourself?

- What are the permissions on that directory?  Is the supervisor daemon running 
as another user?

--
Derek

On 3/10/14, 17:05, P. Taylor Goetz wrote:

I don't have access to a windows machine at the moment, but does this help?

http://support.microsoft.com/kb/832434

On Mar 10, 2014, at 4:51 PM, Chris James chris.james.cont...@gmail.com 
mailto:chris.james.cont...@gmail.com wrote:


Reposting since I posted this before at a poor time and got no response.

I'm trying out a storm project built from scratch in Java, but with a Python 
bolt.  I have everything running with all Java spouts/bolts just fine, but when 
I try to incorporate a python bolt I am running into issues.

I have my project separated into a /storm/ for topologies, /storm/bolts/ for 
bolts, /storm/spouts for spouts, and /storm/multilang/ for the multilang 
wrappers. Right now the only thing in /storm/multilang/ is storm.py, copied and 
pasted from the storm-starter project.  In my bolts folder, I have a dummy bolt 
set up that just prints the tuple.  I've virtually mimicked the storm-starter 
WordCountTopology example for using a python bolt, so I think the code is OK 
and the configuration is the issue.

So my question is simple.  What configuration steps do I have to set up so that my topology knows 
where to look to find storm.py when I run super(python, 
dummypythonbolt.py)?  I noticed an error in the stack trace claiming that it could not 
run python (python is definitely on my path and I use it everyday), and that is looking in a 
resources folder that does not exist.  Here is the line in question:

Caused by: java.io.IOException: Cannot run program python (in directory 
C:\Users\chris\AppData\Local\Temp\67daff0e-7348-46ee-9b62-83f8ee4e431c\supervisor\stormdist\dummy-topology-1-1394418571\resources):
 CreateProcess error=267, The directory name is invalid

A more extensive stack trace is here: http://pastebin.com/6yx97m0M

So once again: what is the configuration step that I am missing to allow my 
topology to see storm.py and be able to run multilang spouts/bolts in my 
topology?

Thanks!


Re: [RELEASE] Apache Storm 0.9.1-incubating released (defaults.yaml)

2014-02-26 Thread Derek Dagit

The defaults.yaml file is part of the source distribution and is packaged into 
storm's jar when deployed.

In a storm cluster deployment, it is not meant to be on the file system in 
${storm.home}/conf.

Perhaps you are pointing to your source working tree as storm home?
--
Derek

On 2/26/14, 5:59, Lajos wrote:

Quick question on this: defaults.yaml is in both conf and storm-core.jar, so 
the first time you start nimbus 0.9.1 you get this message:

java.lang.RuntimeException: Found multiple defaults.yaml resources. You're 
probably bundling the Storm jars with your topology jar. 
[file:/scratch/projects/apache-storm-0.9.1-incubating/conf/defaults.yaml, 
jar:file:/scratch/projects/apache-storm-0.9.1-incubating/lib/storm-core-0.9.1-incubating.jar!/defaults.yaml]
 at backtype.storm.utils.Utils.findAndReadConfigFile(Utils.java:133) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
...

Shouldn't conf/defaults.yaml be called like conf/defaults.yaml.copy or 
something? I like that it is in the conf directory, because now I can easily 
see all the config options instead of having to go to the source directory. But 
it shouldn't prevent startup ...

Thanks,

Lajos



On 22/02/2014 21:09, P. Taylor Goetz wrote:

The Storm team is pleased to announce the release of Apache Storm version 
0.9.1-incubating. This is our first Apache release.

Storm is a distributed, fault-tolerant, and high-performance realtime 
computation system that provides strong guarantees on the processing of data. 
You can read more about Storm on the project website:

http://storm.incubator.apache.org

Downloads of source and binary distributions are listed in our download
section:

http://storm.incubator.apache.org/downloads.html

Distribution artifacts are available in Maven Central at the following 
coordinates:

groupId: org.apache.storm
artifactId: storm-core
version: 0.9.1-incubating

The full list of changes is available here[1]. Please let us know [2] if you 
encounter any problems.

Enjoy!

[1]: http://s.apache.org/Ki0 (CHANGELOG)
[2]: https://issues.apache.org/jira/browse/STORM



Re: How to specify worker.childopts for a specified topology?

2014-02-18 Thread Derek Dagit

Try this:

conf.put(Config.TOPOLOGY_WORKER_CHILDOPTS, WORKER_OPTS);

Your WORKER_OPTS should be appended to WORKER_CHILDOPTS.
--
Derek

On 2/18/14, 1:47, Link Wang wrote:

Dear all,
I want to specify some worker.childopts for my topology inner it's code, and I 
use this way:
conf.put(Config.WORKER_CHILDOPTS, WORKER_OPTS);
but I found it doesn't work.

I don't use storm.yaml file to set worker.childopts, because the memory 
requirement of my topologies are widely different.

is there some one encounter the same problem?


Re: Storm 0.9.0.1 and Zookeeper 3.4.5 hung issue.

2014-02-14 Thread Derek Dagit

Some changes to storm code are necessary for this.

See https://github.com/apache/incubator-storm/pull/29/files
--
Derek

On 2/14/14, 11:50, Saurabh Agarwal (BLOOMBERG/ 731 LEXIN) wrote:

Thanks Bijoy for reply.

We can't downgrade to 3.3.3 as our system has zookeeper 3.4.5 server running. 
and we would like to keep same version of zookeeper client to avoid any 
incompatibility issues.

The error we are getting with 3.4.5 is.
Caused by: java.lang.ClassNotFoundException: 
org.apache.zookeeper.server.NIOServerCnxn$Factory

After looking at zookeeper code, static Factory class within NIOSeverCnxn class 
has been removed in 3.4.5 version.

zookeeper version 3.3.3 is 3 years old. Should not Storm be updated the code to 
run with zookeeper latest version. Should I create a jira for this?


- Original Message -
From: user@storm.incubator.apache.org
To: SAURABH AGARWAL (BLOOMBERG/ 731 LEXIN), user@storm.incubator.apache.org
At: Feb 14 2014 11:45:50

Hi,

We had also downgraded zookeeper from 3.4.5 to 3.3.3 due to issues with
Storm.But we are not facing any issues related to Kafka after the
downgrade.We are using Storm 0.9.0-rc2 and Kafka 0.8.0.

Thanks
Bijoy

On Fri, Feb 14, 2014 at 9:57 PM, Saurabh Agarwal (BLOOMBERG/ 731 LEXIN) 
sagarwal...@bloomberg.net wrote:


Hi,

Storm 0.9.0.1 client linked with zookeeper 3.4.5 library hung on zookeeper
initialize. Is it known issue?

453  [main] INFO  org.apache.zookeeper.server.ZooKeeperServer - init  -
Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
4 datadir /tmp/7b520ac7-ff87-4eb6-9fc5-3a16deec0272/version-2 snapdir
/tmp/7b520ac7-ff87-4eb6-9fc5-3a16deec0272/version-2

The client works fine with zookeeper 3.3.3. As we are using storm with
kafka, kafka does not work with zookeeper 3.3.3 but work with 3.4.5.

any help is appreciated...
Thanks,
Saurabh.




Re: Need to set worker environment variables or system properties

2013-11-20 Thread Derek Dagit

Yeah, use topology.worker.childopts when submitting.  I believe it is appended 
to the cluster's worker.childopts.
--
Derek

On 11/20/13 15:28, Tom Brown wrote:

Is there a way to manage the environment variables or system properties of each 
worker on a topology-by-topology basis?

I would like to include a library with my code, but the library only supports 
configuration through environment variables or system properties.

Thanks in advance!

--Tom