Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Fri, 21 Apr 2017 09:54:29 -0700

No, nothing ...

On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]> wrote:


> Anything going on in the kafka broker logs?
>
> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <[email protected]>
> wrote:
>
>> Although this is a test platform with a way less spec than production, it
>> should be enough for indexing 600 docs per second. I have seen benchmark
>> result of 150-200k docs per second with this spec! I haven't played with
>> tuning the template yet, but I still think the current rate does not make
>> sense at all.
>>
>> I have changed the batch size to 100. Throughput has been dropped, but
>> still a very high rate of failure!
>>
>> Please find the screenshots for the enrichments:
>> http://imgur.com/a/ceC8f
>> http://imgur.com/a/sBQwM
>>
>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]> wrote:
>>
>>> Ok, yeah, those latencies are pretty high.  I think what's happening is
>>> that the tuples aren't being acked fast enough and are timing out.  How
>>> taxed is your ES box?  Can you drop the batch size down to maybe 100 and
>>> see what happens?
>>>
>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> Please find the bolt part of Storm-UI related to indexing topology:
>>>>
>>>> http://imgur.com/a/tFkmO
>>>>
>>>> As you can see a hdfs error has also appeared which is not important
>>>> right now.
>>>>
>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]>
>>>> wrote:
>>>>
>>>>> What's curious is the enrichment topology showing the same issues, but
>>>>> my mind went to ES as well.
>>>>>
>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes which bolt is reporting all those failures?  My theory is that
>>>>>> there is some ES tuning that needs to be done.
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Could I see a little more of that screen?  Specifically what the
>>>>>>> bolts look like.
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>
>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Casey,
>>>>>>>>>
>>>>>>>>> - topology.message.timeout: It was 30s at first. I have increased
>>>>>>>>> it to 300s, no changes!
>>>>>>>>> - It is a very basic geo-enrichment and simple rule for threat
>>>>>>>>> triage!
>>>>>>>>> - No, not at all.
>>>>>>>>> - I have changed that to find the best value. it is 5000 which is
>>>>>>>>> about to 5MB.
>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>> topology.max.spout.pending,
>>>>>>>>> still no changes!
>>>>>>>>>
>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Also,
>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>> * You said you're seeing this in indexing and enrichment, what
>>>>>>>>>> enrichments do you have in place?
>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> So you're seeing failures in the storm topology but no errors in
>>>>>>>>>>> the logs.  Would you mind sending over a screenshot of the indexing
>>>>>>>>>>> topology from the storm UI?  You might not be able to paste the 
>>>>>>>>>>> image on
>>>>>>>>>>> the mailing list, so maybe an imgur link would be in order.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Casey
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> No, I cannot see any error inside the indexing error topic.
>>>>>>>>>>>> Also, the number of tuples is emitted and transferred to the error 
>>>>>>>>>>>> indexing
>>>>>>>>>>>> bolt is zero!
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you see any errors in the error* index in Elasticsearch?
>>>>>>>>>>>>> There are several catch blocks across the different topologies 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> transform errors into json objects and forward them on to the 
>>>>>>>>>>>>> indexing
>>>>>>>>>>>>> topology.  If you're not seeing anything in the worker logs it's 
>>>>>>>>>>>>> likely the
>>>>>>>>>>>>> errors were captured there instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I checked
>>>>>>>>>>>>>> resource consumption at the workers, there had been plenty 
>>>>>>>>>>>>>> resources still
>>>>>>>>>>>>>> available!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have noticed
>>>>>>>>>>>>>>>> the rate of failure for the indexing/enrichment topologies are 
>>>>>>>>>>>>>>>> very high
>>>>>>>>>>>>>>>> (about 95%). However, I can see the messages in Elasticsearch. 
>>>>>>>>>>>>>>>> I have tried
>>>>>>>>>>>>>>>> to increase the timeout value for the acknowledgement. It 
>>>>>>>>>>>>>>>> didn't fix the
>>>>>>>>>>>>>>>> problem. I can set the number of acker executors to 0 to 
>>>>>>>>>>>>>>>> temporarily fix
>>>>>>>>>>>>>>>> the problem which is not a good idea at all. Do you have any 
>>>>>>>>>>>>>>>> idea what have
>>>>>>>>>>>>>>>> caused such issue? The percentage of failure decreases by 
>>>>>>>>>>>>>>>> reducing the
>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, it is 
>>>>>>>>>>>>>>>> still high!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> A.Nazemian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> A.Nazemian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>


-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to