Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Fri, 21 Apr 2017 09:06:08 -0700

Please find the bolt part of Storm-UI related to indexing topology:

http://imgur.com/a/tFkmO


As you can see a hdfs error has also appeared which is not important right
now.

On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]> wrote:

> What's curious is the enrichment topology showing the same issues, but my
> mind went to ES as well.
>
> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <[email protected]>
> wrote:
>
>> Yes which bolt is reporting all those failures?  My theory is that there
>> is some ES tuning that needs to be done.
>>
>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <[email protected]>
>> wrote:
>>
>>> Could I see a little more of that screen?  Specifically what the bolts
>>> look like.
>>>
>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> Please find the storm-UI screenshot as follows.
>>>>
>>>> http://imgur.com/FhIrGFd
>>>>
>>>>
>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Casey,
>>>>>
>>>>> - topology.message.timeout: It was 30s at first. I have increased it
>>>>> to 300s, no changes!
>>>>> - It is a very basic geo-enrichment and simple rule for threat triage!
>>>>> - No, not at all.
>>>>> - I have changed that to find the best value. it is 5000 which is
>>>>> about to 5MB.
>>>>> - I have changed the number of executors for the Storm acker thread,
>>>>> and I have also changed the value of topology.max.spout.pending, still no
>>>>> changes!
>>>>>
>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Also,
>>>>>> * what's your setting for topology.message.timeout?
>>>>>> * You said you're seeing this in indexing and enrichment, what
>>>>>> enrichments do you have in place?
>>>>>> * Is ES being taxed heavily?
>>>>>> * What's your ES batch size for the sensor?
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> So you're seeing failures in the storm topology but no errors in the
>>>>>>> logs.  Would you mind sending over a screenshot of the indexing topology
>>>>>>> from the storm UI?  You might not be able to paste the image on the 
>>>>>>> mailing
>>>>>>> list, so maybe an imgur link would be in order.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Casey
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> No, I cannot see any error inside the indexing error topic. Also,
>>>>>>>> the number of tuples is emitted and transferred to the error indexing 
>>>>>>>> bolt
>>>>>>>> is zero!
>>>>>>>>
>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Do you see any errors in the error* index in Elasticsearch?  There
>>>>>>>>> are several catch blocks across the different topologies that 
>>>>>>>>> transform
>>>>>>>>> errors into json objects and forward them on to the indexing 
>>>>>>>>> topology.  If
>>>>>>>>> you're not seeing anything in the worker logs it's likely the errors 
>>>>>>>>> were
>>>>>>>>> captured there instead.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> No everything is fine at the log level. Also, when I checked
>>>>>>>>>> resource consumption at the workers, there had been plenty resources 
>>>>>>>>>> still
>>>>>>>>>> available!
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> After I tried to tune the Metron performance I have noticed the
>>>>>>>>>>>> rate of failure for the indexing/enrichment topologies are very 
>>>>>>>>>>>> high (about
>>>>>>>>>>>> 95%). However, I can see the messages in Elasticsearch. I have 
>>>>>>>>>>>> tried to
>>>>>>>>>>>> increase the timeout value for the acknowledgement. It didn't fix 
>>>>>>>>>>>> the
>>>>>>>>>>>> problem. I can set the number of acker executors to 0 to 
>>>>>>>>>>>> temporarily fix
>>>>>>>>>>>> the problem which is not a good idea at all. Do you have any idea 
>>>>>>>>>>>> what have
>>>>>>>>>>>> caused such issue? The percentage of failure decreases by reducing 
>>>>>>>>>>>> the
>>>>>>>>>>>> number of parallelism, but even without any parallelism, it is 
>>>>>>>>>>>> still high!
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ali
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> A.Nazemian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> A.Nazemian
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>


-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to