Re: High percentage of failed/timed out tuples after performance tuning!

Casey Stella Fri, 21 Apr 2017 09:00:02 -0700

What's curious is the enrichment topology showing the same issues, but my
mind went to ES as well.


On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <[email protected]> wrote:

> Yes which bolt is reporting all those failures?  My theory is that there
> is some ES tuning that needs to be done.
>
> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <[email protected]> wrote:
>
>> Could I see a little more of that screen?  Specifically what the bolts
>> look like.
>>
>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <[email protected]>
>> wrote:
>>
>>> Please find the storm-UI screenshot as follows.
>>>
>>> http://imgur.com/FhIrGFd
>>>
>>>
>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> Hi Casey,
>>>>
>>>> - topology.message.timeout: It was 30s at first. I have increased it to
>>>> 300s, no changes!
>>>> - It is a very basic geo-enrichment and simple rule for threat triage!
>>>> - No, not at all.
>>>> - I have changed that to find the best value. it is 5000 which is about
>>>> to 5MB.
>>>> - I have changed the number of executors for the Storm acker thread,
>>>> and I have also changed the value of topology.max.spout.pending, still no
>>>> changes!
>>>>
>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <[email protected]>
>>>> wrote:
>>>>
>>>>> Also,
>>>>> * what's your setting for topology.message.timeout?
>>>>> * You said you're seeing this in indexing and enrichment, what
>>>>> enrichments do you have in place?
>>>>> * Is ES being taxed heavily?
>>>>> * What's your ES batch size for the sensor?
>>>>>
>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> So you're seeing failures in the storm topology but no errors in the
>>>>>> logs.  Would you mind sending over a screenshot of the indexing topology
>>>>>> from the storm UI?  You might not be able to paste the image on the 
>>>>>> mailing
>>>>>> list, so maybe an imgur link would be in order.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Casey
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> No, I cannot see any error inside the indexing error topic. Also,
>>>>>>> the number of tuples is emitted and transferred to the error indexing 
>>>>>>> bolt
>>>>>>> is zero!
>>>>>>>
>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Do you see any errors in the error* index in Elasticsearch?  There
>>>>>>>> are several catch blocks across the different topologies that transform
>>>>>>>> errors into json objects and forward them on to the indexing topology. 
>>>>>>>>  If
>>>>>>>> you're not seeing anything in the worker logs it's likely the errors 
>>>>>>>> were
>>>>>>>> captured there instead.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> No everything is fine at the log level. Also, when I checked
>>>>>>>>> resource consumption at the workers, there had been plenty resources 
>>>>>>>>> still
>>>>>>>>> available!
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> After I tried to tune the Metron performance I have noticed the
>>>>>>>>>>> rate of failure for the indexing/enrichment topologies are very 
>>>>>>>>>>> high (about
>>>>>>>>>>> 95%). However, I can see the messages in Elasticsearch. I have 
>>>>>>>>>>> tried to
>>>>>>>>>>> increase the timeout value for the acknowledgement. It didn't fix 
>>>>>>>>>>> the
>>>>>>>>>>> problem. I can set the number of acker executors to 0 to 
>>>>>>>>>>> temporarily fix
>>>>>>>>>>> the problem which is not a good idea at all. Do you have any idea 
>>>>>>>>>>> what have
>>>>>>>>>>> caused such issue? The percentage of failure decreases by reducing 
>>>>>>>>>>> the
>>>>>>>>>>> number of parallelism, but even without any parallelism, it is 
>>>>>>>>>>> still high!
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ali
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> A.Nazemian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> A.Nazemian
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to