RE: Storm failing all the tuples post worker restart

M Tarkeshwar Rao Tue, 04 Nov 2014 23:25:21 -0800

https://issues.apache.org/jira/browse/STORM-406

From: Devang Shah [mailto:[email protected]]
Sent: 04 November 2014 10:23
To: [email protected]
Subject: Re: Storm failing all the tuples post worker restart

Thanks Sean.

We are using 0.9.2

Have not tried the issue with 0.9.3. Will try with this and confirm back.

Any ideas when Storm 0.9.3 will be available to be used in production 
environments.

Thanks and Regards,
Devang
On 3 Nov 2014 11:38, "Sean Zhong" 
<[email protected]<mailto:[email protected]>> wrote:
Hi Devang,

Which storm version are you using?
You may want to check STORM-404 and STORM-329.

Sean

On Mon, Nov 3, 2014 at 9:27 AM, Devang Shah 
<[email protected]<mailto:[email protected]>> wrote:

Thanks much for notifying.

Would you know the bug id ? I did refer to the change log of 0.9.3 but could 
not get hold of the bug id. Incidentally I too have raised a jira and would 
like to close it giving reference to the previously raised jira.
Thanks.
On 31 Oct 2014 21:49, "M.Tarkeshwar Rao" 
<[email protected]<mailto:[email protected]>> wrote:

Yes it is the bug which is raised by Denigel.fixed in 9.3.pls use it. Or use 
zero mq in place of netty ur problem will be resolved.
On 27 Oct 2014 20:52, "Devang Shah" 
<[email protected]<mailto:[email protected]>> wrote:

It seems to be a bug in storm unless someone confirms otherwise.

How can I file a bug for storm ?
On 25 Oct 2014 07:51, "Devang Shah" 
<[email protected]<mailto:[email protected]>> wrote:

You are correct Taylor. Sorry missed to mention all the details.

We have topology.spout.max.pending set to 1000 and we have not modified the 
topology.message.timeouts.secs (default 30 secs).

Another observation,
When I delibrately bring down the worker (kill -9) and when the worker is 
brought up on the same port it was running previously on then the storm starts 
failing all the messages despite it being successfully processed. If the worker 
is brought up on different supervisor port then the issue doesn't seem to occur.

Eg steps,
1. Worker running on 6703 supervisor slot(this worker runs a single spout 
instance of our topology) and everything runs fine. Messages get procesed and 
acks back to the message provider. If I let it run in this state it can process 
any number of messages.
2. I bring down the java process by kill -9
3. Supervisor brings up the worker on the same slot 6703 and also the spout 
task instance on it.
4. All the messages get processed fine but the ackers fail all the messages the 
topology processed after the default 30 secs timeout. This even happens when 
topology is idle and I push a single message into the topology. So my guess is 
increasing the timeout will not help (though I have not tried it).
5.If the supervisor brings up the worker on a different slot say 6700 then the 
issue doen't seem to occur. Probably a bug in storm.

Steps to simulate the behaviour,
1. Run topology(spout as single instance and multiple instances of bolts) with 
multiple workers.
2. Identify the slot on which the single spout instance is running and kill it.
3. See if the supervisor started the worker on the same port. If not then 
repeat step 2 untill you get supervisor on the same slot as previous one.
4. Pump in a message into the topology.
5. You will see message being processed successfully and also the ackers 
failing the message. This can be verified by logging statements in the ack and 
fail methods of the spout.

Thanks and Regards,
Devang
On 25 Oct 2014 04:34, "P. Taylor Goetz" 
<[email protected]<mailto:[email protected]>> wrote:
My guess is that you are getting timeouts.

Do you have topology.spout.max.pending set? If so, what is the value.
Have you overridden topology.message.timeout.secs (default is 30 seconds)?

Look in Storm UI for the complete latency of the topology. Is it close to or 
greater than topology.message.timeout.secs?

-Taylor

On Oct 23, 2014, at 12:44 PM, Devang Shah 
<[email protected]<mailto:[email protected]>> wrote:

Hi Team,

I am facing an issue with one of our failover tests. Storm fails all the 
messages post worker restarts.

Steps done,
0. 1 spout, 3 bolts, 5 ackers
1. Pre-load tibems with 50k messages
2. Start the topology
3. Let it run for brief time and the kill the worker where the spout is 
executing (spout in our topology is a single instance)
4. The worker is brought up by the supervisor automatically

Observation/query,
When spout starts pumping in data again into the topology, storm starts failing 
the messages even though they are successfully processed (I have verified this 
as our last bolt pushes data to kafka and the incoming/kafka data njmber 
matches). I have checked the tuple anchoring and that seems to be fine as 
without the worker restarts the topology acks and processes messages fine.

Any thing I should check again ?

RE: Storm failing all the tuples post worker restart

Reply via email to