Re: Storm failing all the tuples post worker restart

Sean Zhong Sun, 02 Nov 2014 19:39:08 -0800

Hi Devang,

Which storm version are you using?
You may want to check STORM-404 and STORM-329.


Sean


On Mon, Nov 3, 2014 at 9:27 AM, Devang Shah <[email protected]> wrote:

> Thanks much for notifying.
>
> Would you know the bug id ? I did refer to the change log of 0.9.3 but
> could not get hold of the bug id. Incidentally I too have raised a jira and
> would like to close it giving reference to the previously raised jira.
> Thanks.
> On 31 Oct 2014 21:49, "M.Tarkeshwar Rao" <[email protected]> wrote:
>
>> Yes it is the bug which is raised by Denigel.fixed in 9.3.pls use it. Or
>> use zero mq in place of netty ur problem will be resolved.
>> On 27 Oct 2014 20:52, "Devang Shah" <[email protected]> wrote:
>>
>>> It seems to be a bug in storm unless someone confirms otherwise.
>>>
>>> How can I file a bug for storm ?
>>>  On 25 Oct 2014 07:51, "Devang Shah" <[email protected]> wrote:
>>>
>>>> You are correct Taylor. Sorry missed to mention all the details.
>>>>
>>>> We have topology.spout.max.pending set to 1000 and we have not modified
>>>> the topology.message.timeouts.secs (default 30 secs).
>>>>
>>>> Another observation,
>>>> When I delibrately bring down the worker (kill -9) and when the worker
>>>> is brought up on the same port it was running previously on then the storm
>>>> starts failing all the messages despite it being successfully processed. If
>>>> the worker is brought up on different supervisor port then the issue
>>>> doesn't seem to occur.
>>>>
>>>> Eg steps,
>>>> 1. Worker running on 6703 supervisor slot(this worker runs a single
>>>> spout instance of our topology) and everything runs fine. Messages get
>>>> procesed and acks back to the message provider. If I let it run in this
>>>> state it can process any number of messages.
>>>> 2. I bring down the java process by kill -9
>>>> 3. Supervisor brings up the worker on the same slot 6703 and also the
>>>> spout task instance on it.
>>>> 4. All the messages get processed fine but the ackers fail all the
>>>> messages the topology processed after the default 30 secs timeout. This
>>>> even happens when topology is idle and I push a single message into the
>>>> topology. So my guess is increasing the timeout will not help (though I
>>>> have not tried it).
>>>> 5.If the supervisor brings up the worker on a different slot say 6700
>>>> then the issue doen't seem to occur. Probably a bug in storm.
>>>>
>>>> Steps to simulate the behaviour,
>>>> 1. Run topology(spout as single instance and multiple instances of
>>>> bolts) with multiple workers.
>>>> 2. Identify the slot on which the single spout instance is running and
>>>> kill it.
>>>> 3. See if the supervisor started the worker on the same port. If not
>>>> then repeat step 2 untill you get supervisor on the same slot as previous
>>>> one.
>>>> 4. Pump in a message into the topology.
>>>> 5. You will see message being processed successfully and also the
>>>> ackers failing the message. This can be verified by logging statements in
>>>> the ack and fail methods of the spout.
>>>>
>>>> Thanks and Regards,
>>>> Devang
>>>> On 25 Oct 2014 04:34, "P. Taylor Goetz" <[email protected]> wrote:
>>>>
>>>>> My guess is that you are getting timeouts.
>>>>>
>>>>> Do you have topology.spout.max.pending set? If so, what is the value.
>>>>> Have you overridden topology.message.timeout.secs (default is 30
>>>>> seconds)?
>>>>>
>>>>> Look in Storm UI for the complete latency of the topology. Is it close
>>>>> to or greater than topology.message.timeout.secs?
>>>>>
>>>>>
>>>>> -Taylor
>>>>>
>>>>>
>>>>> On Oct 23, 2014, at 12:44 PM, Devang Shah <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi Team,
>>>>>
>>>>> I am facing an issue with one of our failover tests. Storm fails all
>>>>> the messages post worker restarts.
>>>>>
>>>>> Steps done,
>>>>> 0. 1 spout, 3 bolts, 5 ackers
>>>>> 1. Pre-load tibems with 50k messages
>>>>> 2. Start the topology
>>>>> 3. Let it run for brief time and the kill the worker where the spout
>>>>> is executing (spout in our topology is a single instance)
>>>>> 4. The worker is brought up by the supervisor automatically
>>>>>
>>>>> Observation/query,
>>>>> When spout starts pumping in data again into the topology, storm
>>>>> starts failing the messages even though they are successfully processed (I
>>>>> have verified this as our last bolt pushes data to kafka and the
>>>>> incoming/kafka data njmber matches). I have checked the tuple anchoring 
>>>>> and
>>>>> that seems to be fine as without the worker restarts the topology acks and
>>>>> processes messages fine.
>>>>>
>>>>> Any thing I should check again ?
>>>>>
>>>>>
>>>>>

Re: Storm failing all the tuples post worker restart

Reply via email to