Re: Storm failing all the tuples post worker restart

Devang Shah Sun, 02 Nov 2014 17:30:07 -0800

Thanks much for notifying.

Would you know the bug id ? I did refer to the change log of 0.9.3 but
could not get hold of the bug id. Incidentally I too have raised a jira and
would like to close it giving reference to the previously raised jira.
Thanks.
On 31 Oct 2014 21:49, "M.Tarkeshwar Rao" <[email protected]> wrote:


> Yes it is the bug which is raised by Denigel.fixed in 9.3.pls use it. Or
> use zero mq in place of netty ur problem will be resolved.
> On 27 Oct 2014 20:52, "Devang Shah" <[email protected]> wrote:
>
>> It seems to be a bug in storm unless someone confirms otherwise.
>>
>> How can I file a bug for storm ?
>>  On 25 Oct 2014 07:51, "Devang Shah" <[email protected]> wrote:
>>
>>> You are correct Taylor. Sorry missed to mention all the details.
>>>
>>> We have topology.spout.max.pending set to 1000 and we have not modified
>>> the topology.message.timeouts.secs (default 30 secs).
>>>
>>> Another observation,
>>> When I delibrately bring down the worker (kill -9) and when the worker
>>> is brought up on the same port it was running previously on then the storm
>>> starts failing all the messages despite it being successfully processed. If
>>> the worker is brought up on different supervisor port then the issue
>>> doesn't seem to occur.
>>>
>>> Eg steps,
>>> 1. Worker running on 6703 supervisor slot(this worker runs a single
>>> spout instance of our topology) and everything runs fine. Messages get
>>> procesed and acks back to the message provider. If I let it run in this
>>> state it can process any number of messages.
>>> 2. I bring down the java process by kill -9
>>> 3. Supervisor brings up the worker on the same slot 6703 and also the
>>> spout task instance on it.
>>> 4. All the messages get processed fine but the ackers fail all the
>>> messages the topology processed after the default 30 secs timeout. This
>>> even happens when topology is idle and I push a single message into the
>>> topology. So my guess is increasing the timeout will not help (though I
>>> have not tried it).
>>> 5.If the supervisor brings up the worker on a different slot say 6700
>>> then the issue doen't seem to occur. Probably a bug in storm.
>>>
>>> Steps to simulate the behaviour,
>>> 1. Run topology(spout as single instance and multiple instances of
>>> bolts) with multiple workers.
>>> 2. Identify the slot on which the single spout instance is running and
>>> kill it.
>>> 3. See if the supervisor started the worker on the same port. If not
>>> then repeat step 2 untill you get supervisor on the same slot as previous
>>> one.
>>> 4. Pump in a message into the topology.
>>> 5. You will see message being processed successfully and also the ackers
>>> failing the message. This can be verified by logging statements in the ack
>>> and fail methods of the spout.
>>>
>>> Thanks and Regards,
>>> Devang
>>> On 25 Oct 2014 04:34, "P. Taylor Goetz" <[email protected]> wrote:
>>>
>>>> My guess is that you are getting timeouts.
>>>>
>>>> Do you have topology.spout.max.pending set? If so, what is the value.
>>>> Have you overridden topology.message.timeout.secs (default is 30
>>>> seconds)?
>>>>
>>>> Look in Storm UI for the complete latency of the topology. Is it close
>>>> to or greater than topology.message.timeout.secs?
>>>>
>>>>
>>>> -Taylor
>>>>
>>>>
>>>> On Oct 23, 2014, at 12:44 PM, Devang Shah <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Team,
>>>>
>>>> I am facing an issue with one of our failover tests. Storm fails all
>>>> the messages post worker restarts.
>>>>
>>>> Steps done,
>>>> 0. 1 spout, 3 bolts, 5 ackers
>>>> 1. Pre-load tibems with 50k messages
>>>> 2. Start the topology
>>>> 3. Let it run for brief time and the kill the worker where the spout is
>>>> executing (spout in our topology is a single instance)
>>>> 4. The worker is brought up by the supervisor automatically
>>>>
>>>> Observation/query,
>>>> When spout starts pumping in data again into the topology, storm starts
>>>> failing the messages even though they are successfully processed (I have
>>>> verified this as our last bolt pushes data to kafka and the incoming/kafka
>>>> data njmber matches). I have checked the tuple anchoring and that seems to
>>>> be fine as without the worker restarts the topology acks and processes
>>>> messages fine.
>>>>
>>>> Any thing I should check again ?
>>>>
>>>>
>>>>

Re: Storm failing all the tuples post worker restart

Reply via email to