Re: High percentage of failed/timed out tuples after performance tuning!

Casey Stella Fri, 21 Apr 2017 10:12:30 -0700

Ok, so ignoring the indexing topology, the fact that you're seeing failures
in the enrichment topology, which has no ES component, is telling.  It's
also telling that the enrichment topology stats are perfectly sensible
latency-wise (i.e. it's not sweating).


What's your storm configuration for topology.max.spout.pending?  If it's
not set, then try setting it to 1000 and bouncing the topologies.

On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <[email protected]>
wrote:

> No, nothing ...
>
> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]> wrote:
>
>> Anything going on in the kafka broker logs?
>>
>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <[email protected]>
>> wrote:
>>
>>> Although this is a test platform with a way less spec than production,
>>> it should be enough for indexing 600 docs per second. I have seen benchmark
>>> result of 150-200k docs per second with this spec! I haven't played with
>>> tuning the template yet, but I still think the current rate does not make
>>> sense at all.
>>>
>>> I have changed the batch size to 100. Throughput has been dropped, but
>>> still a very high rate of failure!
>>>
>>> Please find the screenshots for the enrichments:
>>> http://imgur.com/a/ceC8f
>>> http://imgur.com/a/sBQwM
>>>
>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]>
>>> wrote:
>>>
>>>> Ok, yeah, those latencies are pretty high.  I think what's happening is
>>>> that the tuples aren't being acked fast enough and are timing out.  How
>>>> taxed is your ES box?  Can you drop the batch size down to maybe 100 and
>>>> see what happens?
>>>>
>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <[email protected]>
>>>> wrote:
>>>>
>>>>> Please find the bolt part of Storm-UI related to indexing topology:
>>>>>
>>>>> http://imgur.com/a/tFkmO
>>>>>
>>>>> As you can see a hdfs error has also appeared which is not important
>>>>> right now.
>>>>>
>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> What's curious is the enrichment topology showing the same issues,
>>>>>> but my mind went to ES as well.
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes which bolt is reporting all those failures?  My theory is that
>>>>>>> there is some ES tuning that needs to be done.
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Could I see a little more of that screen?  Specifically what the
>>>>>>>> bolts look like.
>>>>>>>>
>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>
>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Casey,
>>>>>>>>>>
>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have increased
>>>>>>>>>> it to 300s, no changes!
>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for threat
>>>>>>>>>> triage!
>>>>>>>>>> - No, not at all.
>>>>>>>>>> - I have changed that to find the best value. it is 5000 which is
>>>>>>>>>> about to 5MB.
>>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>>> topology.max.spout.pending,
>>>>>>>>>> still no changes!
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Also,
>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>> * You said you're seeing this in indexing and enrichment, what
>>>>>>>>>>> enrichments do you have in place?
>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> So you're seeing failures in the storm topology but no errors
>>>>>>>>>>>> in the logs.  Would you mind sending over a screenshot of the 
>>>>>>>>>>>> indexing
>>>>>>>>>>>> topology from the storm UI?  You might not be able to paste the 
>>>>>>>>>>>> image on
>>>>>>>>>>>> the mailing list, so maybe an imgur link would be in order.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Casey
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, I cannot see any error inside the indexing error topic.
>>>>>>>>>>>>> Also, the number of tuples is emitted and transferred to the 
>>>>>>>>>>>>> error indexing
>>>>>>>>>>>>> bolt is zero!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you see any errors in the error* index in Elasticsearch?
>>>>>>>>>>>>>> There are several catch blocks across the different topologies 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> transform errors into json objects and forward them on to the 
>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>> topology.  If you're not seeing anything in the worker logs it's 
>>>>>>>>>>>>>> likely the
>>>>>>>>>>>>>> errors were captured there instead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I checked
>>>>>>>>>>>>>>> resource consumption at the workers, there had been plenty 
>>>>>>>>>>>>>>> resources still
>>>>>>>>>>>>>>> available!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment 
>>>>>>>>>>>>>>>>> topologies are very
>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>> tried to increase the timeout value for the acknowledgement. 
>>>>>>>>>>>>>>>>> It didn't fix
>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 0 to 
>>>>>>>>>>>>>>>>> temporarily
>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you have 
>>>>>>>>>>>>>>>>> any idea what
>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure decreases 
>>>>>>>>>>>>>>>>> by reducing the
>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, it 
>>>>>>>>>>>>>>>>> is still high!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> A.Nazemian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> A.Nazemian
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>
>
> --
> A.Nazemian
>

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to