Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Fri, 21 Apr 2017 21:51:06 -0700

I have disabled the reliability retry by setting the number of
acker executors to zero. I can see based on the number of tuples have been
emitted on the indexing topologies and the number of documents in
Elasticsearch there is almost no missing document. It seems for some reason
acker executors can not pick the acknowledgement for indexing and
enrichments topology. However, it can be seen at the destination of those
topologies.


I am also wondering where the best approach would be to find the failed
tuples? I though I can find them in the corresponding error topics which
seem to be not like that.

On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]> wrote:

> Is the following fact rings any bell?
>
> There is no failure at the bolt level acknowledgement, but from the
> topology status, the rate of failure is very high! This is the same
> scenario for both indexing and enrichment topologies.
>
> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]>
> wrote:
>
>> The value for topology.max.spout.pending is 1000 currently. I did
>> decrease it previously to understand the effect of that value on my
>> problem. Clearly, throughput dropped, but still a very high rate of failure!
>>
>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]> wrote:
>>
>>> Ok, so ignoring the indexing topology, the fact that you're seeing
>>> failures in the enrichment topology, which has no ES component, is
>>> telling.  It's also telling that the enrichment topology stats are
>>> perfectly sensible latency-wise (i.e. it's not sweating).
>>>
>>> What's your storm configuration for topology.max.spout.pending?  If
>>> it's not set, then try setting it to 1000 and bouncing the topologies.
>>>
>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> No, nothing ...
>>>>
>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]>
>>>> wrote:
>>>>
>>>>> Anything going on in the kafka broker logs?
>>>>>
>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Although this is a test platform with a way less spec than
>>>>>> production, it should be enough for indexing 600 docs per second. I have
>>>>>> seen benchmark result of 150-200k docs per second with this spec! I 
>>>>>> haven't
>>>>>> played with tuning the template yet, but I still think the current rate
>>>>>> does not make sense at all.
>>>>>>
>>>>>> I have changed the batch size to 100. Throughput has been dropped,
>>>>>> but still a very high rate of failure!
>>>>>>
>>>>>> Please find the screenshots for the enrichments:
>>>>>> http://imgur.com/a/ceC8f
>>>>>> http://imgur.com/a/sBQwM
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, yeah, those latencies are pretty high.  I think what's happening
>>>>>>> is that the tuples aren't being acked fast enough and are timing out.  
>>>>>>> How
>>>>>>> taxed is your ES box?  Can you drop the batch size down to maybe 100 and
>>>>>>> see what happens?
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Please find the bolt part of Storm-UI related to indexing topology:
>>>>>>>>
>>>>>>>> http://imgur.com/a/tFkmO
>>>>>>>>
>>>>>>>> As you can see a hdfs error has also appeared which is not
>>>>>>>> important right now.
>>>>>>>>
>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What's curious is the enrichment topology showing the same issues,
>>>>>>>>> but my mind went to ES as well.
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Yes which bolt is reporting all those failures?  My theory is
>>>>>>>>>> that there is some ES tuning that needs to be done.
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Could I see a little more of that screen?  Specifically what the
>>>>>>>>>>> bolts look like.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Casey,
>>>>>>>>>>>>>
>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have
>>>>>>>>>>>>> increased it to 300s, no changes!
>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for threat
>>>>>>>>>>>>> triage!
>>>>>>>>>>>>> - No, not at all.
>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000 which
>>>>>>>>>>>>> is about to 5MB.
>>>>>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>>>>>> topology.max.spout.pending,
>>>>>>>>>>>>> still no changes!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment,
>>>>>>>>>>>>>> what enrichments do you have in place?
>>>>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no
>>>>>>>>>>>>>>> errors in the logs.  Would you mind sending over a screenshot 
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>> indexing topology from the storm UI?  You might not be able to 
>>>>>>>>>>>>>>> paste the
>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be in 
>>>>>>>>>>>>>>> order.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error topic.
>>>>>>>>>>>>>>>> Also, the number of tuples is emitted and transferred to the 
>>>>>>>>>>>>>>>> error indexing
>>>>>>>>>>>>>>>> bolt is zero!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in
>>>>>>>>>>>>>>>>> Elasticsearch?  There are several catch blocks across the 
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and 
>>>>>>>>>>>>>>>>> forward them on to
>>>>>>>>>>>>>>>>> the indexing topology.  If you're not seeing anything in the 
>>>>>>>>>>>>>>>>> worker logs
>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I
>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had been 
>>>>>>>>>>>>>>>>>> plenty
>>>>>>>>>>>>>>>>>> resources still available!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment 
>>>>>>>>>>>>>>>>>>>> topologies are very
>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the 
>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix
>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 0 
>>>>>>>>>>>>>>>>>>>> to temporarily
>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you 
>>>>>>>>>>>>>>>>>>>> have any idea what
>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure 
>>>>>>>>>>>>>>>>>>>> decreases by reducing the
>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, 
>>>>>>>>>>>>>>>>>>>> it is still high!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> A.Nazemian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> A.Nazemian
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to