Re: High percentage of failed/timed out tuples after performance tuning!

Casey Stella Sat, 22 Apr 2017 06:56:34 -0700

So what is perplexing is that the latency is low and the capacity for each
bolt is less than 1, so it's keeping up. I would have expected this kind of
thing if the latency was high and timeouts were happening.


If you drop the spout pending config lower do you get to a point with no
errors (at obvious consequences to throughput)? Also how many ackers are
you running?
On Sat, Apr 22, 2017 at 00:50 Ali Nazemian <[email protected]> wrote:

> I have disabled the reliability retry by setting the number of
> acker executors to zero. I can see based on the number of tuples have been
> emitted on the indexing topologies and the number of documents in
> Elasticsearch there is almost no missing document. It seems for some reason
> acker executors can not pick the acknowledgement for indexing and
> enrichments topology. However, it can be seen at the destination of those
> topologies.
>
> I am also wondering where the best approach would be to find the failed
> tuples? I though I can find them in the corresponding error topics which
> seem to be not like that.
>
> On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]>
> wrote:
>
>> Is the following fact rings any bell?
>>
>> There is no failure at the bolt level acknowledgement, but from the
>> topology status, the rate of failure is very high! This is the same
>> scenario for both indexing and enrichment topologies.
>>
>> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]>
>> wrote:
>>
>>> The value for topology.max.spout.pending is 1000 currently. I did
>>> decrease it previously to understand the effect of that value on my
>>> problem. Clearly, throughput dropped, but still a very high rate of failure!
>>>
>>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]>
>>> wrote:
>>>
>>>> Ok, so ignoring the indexing topology, the fact that you're seeing
>>>> failures in the enrichment topology, which has no ES component, is
>>>> telling.  It's also telling that the enrichment topology stats are
>>>> perfectly sensible latency-wise (i.e. it's not sweating).
>>>>
>>>> What's your storm configuration for topology.max.spout.pending?  If
>>>> it's not set, then try setting it to 1000 and bouncing the topologies.
>>>>
>>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <[email protected]>
>>>> wrote:
>>>>
>>>>> No, nothing ...
>>>>>
>>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Anything going on in the kafka broker logs?
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Although this is a test platform with a way less spec than
>>>>>>> production, it should be enough for indexing 600 docs per second. I have
>>>>>>> seen benchmark result of 150-200k docs per second with this spec! I 
>>>>>>> haven't
>>>>>>> played with tuning the template yet, but I still think the current rate
>>>>>>> does not make sense at all.
>>>>>>>
>>>>>>> I have changed the batch size to 100. Throughput has been dropped,
>>>>>>> but still a very high rate of failure!
>>>>>>>
>>>>>>> Please find the screenshots for the enrichments:
>>>>>>> http://imgur.com/a/ceC8f
>>>>>>> http://imgur.com/a/sBQwM
>>>>>>>
>>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, yeah, those latencies are pretty high.  I think what's
>>>>>>>> happening is that the tuples aren't being acked fast enough and are 
>>>>>>>> timing
>>>>>>>> out.  How taxed is your ES box?  Can you drop the batch size down to 
>>>>>>>> maybe
>>>>>>>> 100 and see what happens?
>>>>>>>>
>>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Please find the bolt part of Storm-UI related to indexing topology:
>>>>>>>>>
>>>>>>>>> http://imgur.com/a/tFkmO
>>>>>>>>>
>>>>>>>>> As you can see a hdfs error has also appeared which is not
>>>>>>>>> important right now.
>>>>>>>>>
>>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> What's curious is the enrichment topology showing the same
>>>>>>>>>> issues, but my mind went to ES as well.
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes which bolt is reporting all those failures?  My theory is
>>>>>>>>>>> that there is some ES tuning that needs to be done.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Could I see a little more of that screen?  Specifically what
>>>>>>>>>>>> the bolts look like.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Casey,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have
>>>>>>>>>>>>>> increased it to 300s, no changes!
>>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for
>>>>>>>>>>>>>> threat triage!
>>>>>>>>>>>>>> - No, not at all.
>>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000
>>>>>>>>>>>>>> which is about to 5MB.
>>>>>>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>>>>>>> topology.max.spout.pending,
>>>>>>>>>>>>>> still no changes!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment,
>>>>>>>>>>>>>>> what enrichments do you have in place?
>>>>>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no
>>>>>>>>>>>>>>>> errors in the logs.  Would you mind sending over a screenshot 
>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>> indexing topology from the storm UI?  You might not be able to 
>>>>>>>>>>>>>>>> paste the
>>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be in 
>>>>>>>>>>>>>>>> order.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error
>>>>>>>>>>>>>>>>> topic. Also, the number of tuples is emitted and transferred 
>>>>>>>>>>>>>>>>> to the error
>>>>>>>>>>>>>>>>> indexing bolt is zero!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in
>>>>>>>>>>>>>>>>>> Elasticsearch?  There are several catch blocks across the 
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and 
>>>>>>>>>>>>>>>>>> forward them on to
>>>>>>>>>>>>>>>>>> the indexing topology.  If you're not seeing anything in the 
>>>>>>>>>>>>>>>>>> worker logs
>>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I
>>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had been 
>>>>>>>>>>>>>>>>>>> plenty
>>>>>>>>>>>>>>>>>>> resources still available!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment 
>>>>>>>>>>>>>>>>>>>>> topologies are very
>>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the 
>>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix
>>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 0 
>>>>>>>>>>>>>>>>>>>>> to temporarily
>>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you 
>>>>>>>>>>>>>>>>>>>>> have any idea what
>>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure 
>>>>>>>>>>>>>>>>>>>>> decreases by reducing the
>>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, 
>>>>>>>>>>>>>>>>>>>>> it is still high!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> A.Nazemian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> A.Nazemian
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> A.Nazemian
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to