Re: Storm failing all the tuples post worker restart

M.Tarkeshwar Rao Tue, 04 Nov 2014 23:24:19 -0800

issue is

https://issues.apache.org/jira/browse/STORM-406


On Tue, Nov 4, 2014 at 10:22 AM, Devang Shah <[email protected]> wrote:

> Thanks Sean.
>
> We are using 0.9.2
>
> Have not tried the issue with 0.9.3. Will try with this and confirm back.
>
> Any ideas when Storm 0.9.3 will be available to be used in production
> environments.
>
> Thanks and Regards,
> Devang
> On 3 Nov 2014 11:38, "Sean Zhong" <[email protected]> wrote:
>
>> Hi Devang,
>>
>> Which storm version are you using?
>> You may want to check STORM-404 and STORM-329.
>>
>> Sean
>>
>>
>> On Mon, Nov 3, 2014 at 9:27 AM, Devang Shah <[email protected]>
>> wrote:
>>
>>> Thanks much for notifying.
>>>
>>> Would you know the bug id ? I did refer to the change log of 0.9.3 but
>>> could not get hold of the bug id. Incidentally I too have raised a jira and
>>> would like to close it giving reference to the previously raised jira.
>>> Thanks.
>>> On 31 Oct 2014 21:49, "M.Tarkeshwar Rao" <[email protected]> wrote:
>>>
>>>> Yes it is the bug which is raised by Denigel.fixed in 9.3.pls use it.
>>>> Or use zero mq in place of netty ur problem will be resolved.
>>>> On 27 Oct 2014 20:52, "Devang Shah" <[email protected]> wrote:
>>>>
>>>>> It seems to be a bug in storm unless someone confirms otherwise.
>>>>>
>>>>> How can I file a bug for storm ?
>>>>>  On 25 Oct 2014 07:51, "Devang Shah" <[email protected]> wrote:
>>>>>
>>>>>> You are correct Taylor. Sorry missed to mention all the details.
>>>>>>
>>>>>> We have topology.spout.max.pending set to 1000 and we have not
>>>>>> modified the topology.message.timeouts.secs (default 30 secs).
>>>>>>
>>>>>> Another observation,
>>>>>> When I delibrately bring down the worker (kill -9) and when the
>>>>>> worker is brought up on the same port it was running previously on then 
>>>>>> the
>>>>>> storm starts failing all the messages despite it being successfully
>>>>>> processed. If the worker is brought up on different supervisor port then
>>>>>> the issue doesn't seem to occur.
>>>>>>
>>>>>> Eg steps,
>>>>>> 1. Worker running on 6703 supervisor slot(this worker runs a single
>>>>>> spout instance of our topology) and everything runs fine. Messages get
>>>>>> procesed and acks back to the message provider. If I let it run in this
>>>>>> state it can process any number of messages.
>>>>>> 2. I bring down the java process by kill -9
>>>>>> 3. Supervisor brings up the worker on the same slot 6703 and also the
>>>>>> spout task instance on it.
>>>>>> 4. All the messages get processed fine but the ackers fail all the
>>>>>> messages the topology processed after the default 30 secs timeout. This
>>>>>> even happens when topology is idle and I push a single message into the
>>>>>> topology. So my guess is increasing the timeout will not help (though I
>>>>>> have not tried it).
>>>>>> 5.If the supervisor brings up the worker on a different slot say 6700
>>>>>> then the issue doen't seem to occur. Probably a bug in storm.
>>>>>>
>>>>>> Steps to simulate the behaviour,
>>>>>> 1. Run topology(spout as single instance and multiple instances of
>>>>>> bolts) with multiple workers.
>>>>>> 2. Identify the slot on which the single spout instance is running
>>>>>> and kill it.
>>>>>> 3. See if the supervisor started the worker on the same port. If not
>>>>>> then repeat step 2 untill you get supervisor on the same slot as previous
>>>>>> one.
>>>>>> 4. Pump in a message into the topology.
>>>>>> 5. You will see message being processed successfully and also the
>>>>>> ackers failing the message. This can be verified by logging statements in
>>>>>> the ack and fail methods of the spout.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Devang
>>>>>> On 25 Oct 2014 04:34, "P. Taylor Goetz" <[email protected]> wrote:
>>>>>>
>>>>>>> My guess is that you are getting timeouts.
>>>>>>>
>>>>>>> Do you have topology.spout.max.pending set? If so, what is the value.
>>>>>>> Have you overridden topology.message.timeout.secs (default is 30
>>>>>>> seconds)?
>>>>>>>
>>>>>>> Look in Storm UI for the complete latency of the topology. Is it
>>>>>>> close to or greater than topology.message.timeout.secs?
>>>>>>>
>>>>>>>
>>>>>>> -Taylor
>>>>>>>
>>>>>>>
>>>>>>> On Oct 23, 2014, at 12:44 PM, Devang Shah <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> I am facing an issue with one of our failover tests. Storm fails all
>>>>>>> the messages post worker restarts.
>>>>>>>
>>>>>>> Steps done,
>>>>>>> 0. 1 spout, 3 bolts, 5 ackers
>>>>>>> 1. Pre-load tibems with 50k messages
>>>>>>> 2. Start the topology
>>>>>>> 3. Let it run for brief time and the kill the worker where the spout
>>>>>>> is executing (spout in our topology is a single instance)
>>>>>>> 4. The worker is brought up by the supervisor automatically
>>>>>>>
>>>>>>> Observation/query,
>>>>>>> When spout starts pumping in data again into the topology, storm
>>>>>>> starts failing the messages even though they are successfully processed 
>>>>>>> (I
>>>>>>> have verified this as our last bolt pushes data to kafka and the
>>>>>>> incoming/kafka data njmber matches). I have checked the tuple anchoring 
>>>>>>> and
>>>>>>> that seems to be fine as without the worker restarts the topology acks 
>>>>>>> and
>>>>>>> processes messages fine.
>>>>>>>
>>>>>>> Any thing I should check again ?
>>>>>>>
>>>>>>>
>>>>>>>
>>

Re: Storm failing all the tuples post worker restart

Reply via email to