Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Sat, 22 Apr 2017 07:28:22 -0700

I have noticed if I decrease the parallelism for spouts the failure rate
will drop significantly! The interesting part is throughput is almost the
same! I was using the same value for spout parallelism as the number of
partitions for the corresponding Kafka topic which I thought it is a best
practice for spout parallelism.
The failure rate has been dropped to zero for the spout parallelism value
half of the number of partitions! I think this happened probably due to the
fact that Kafka and Storm have been collocated on the same host. I was
using 1 acker executor per worker.


On Sat, Apr 22, 2017 at 11:55 PM, Casey Stella <[email protected]> wrote:

> So what is perplexing is that the latency is low and the capacity for each
> bolt is less than 1, so it's keeping up. I would have expected this kind of
> thing if the latency was high and timeouts were happening.
>
> If you drop the spout pending config lower do you get to a point with no
> errors (at obvious consequences to throughput)? Also how many ackers are
> you running?
>
> On Sat, Apr 22, 2017 at 00:50 Ali Nazemian <[email protected]> wrote:
>
>> I have disabled the reliability retry by setting the number of
>> acker executors to zero. I can see based on the number of tuples have been
>> emitted on the indexing topologies and the number of documents in
>> Elasticsearch there is almost no missing document. It seems for some reason
>> acker executors can not pick the acknowledgement for indexing and
>> enrichments topology. However, it can be seen at the destination of those
>> topologies.
>>
>> I am also wondering where the best approach would be to find the failed
>> tuples? I though I can find them in the corresponding error topics which
>> seem to be not like that.
>>
>> On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]>
>> wrote:
>>
>>> Is the following fact rings any bell?
>>>
>>> There is no failure at the bolt level acknowledgement, but from the
>>> topology status, the rate of failure is very high! This is the same
>>> scenario for both indexing and enrichment topologies.
>>>
>>> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> The value for topology.max.spout.pending is 1000 currently. I did
>>>> decrease it previously to understand the effect of that value on my
>>>> problem. Clearly, throughput dropped, but still a very high rate of 
>>>> failure!
>>>>
>>>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]>
>>>> wrote:
>>>>
>>>>> Ok, so ignoring the indexing topology, the fact that you're seeing
>>>>> failures in the enrichment topology, which has no ES component, is
>>>>> telling.  It's also telling that the enrichment topology stats are
>>>>> perfectly sensible latency-wise (i.e. it's not sweating).
>>>>>
>>>>> What's your storm configuration for topology.max.spout.pending?  If
>>>>> it's not set, then try setting it to 1000 and bouncing the topologies.
>>>>>
>>>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> No, nothing ...
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Anything going on in the kafka broker logs?
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Although this is a test platform with a way less spec than
>>>>>>>> production, it should be enough for indexing 600 docs per second. I 
>>>>>>>> have
>>>>>>>> seen benchmark result of 150-200k docs per second with this spec! I 
>>>>>>>> haven't
>>>>>>>> played with tuning the template yet, but I still think the current rate
>>>>>>>> does not make sense at all.
>>>>>>>>
>>>>>>>> I have changed the batch size to 100. Throughput has been dropped,
>>>>>>>> but still a very high rate of failure!
>>>>>>>>
>>>>>>>> Please find the screenshots for the enrichments:
>>>>>>>> http://imgur.com/a/ceC8f
>>>>>>>> http://imgur.com/a/sBQwM
>>>>>>>>
>>>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, yeah, those latencies are pretty high.  I think what's
>>>>>>>>> happening is that the tuples aren't being acked fast enough and are 
>>>>>>>>> timing
>>>>>>>>> out.  How taxed is your ES box?  Can you drop the batch size down to 
>>>>>>>>> maybe
>>>>>>>>> 100 and see what happens?
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Please find the bolt part of Storm-UI related to indexing
>>>>>>>>>> topology:
>>>>>>>>>>
>>>>>>>>>> http://imgur.com/a/tFkmO
>>>>>>>>>>
>>>>>>>>>> As you can see a hdfs error has also appeared which is not
>>>>>>>>>> important right now.
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> What's curious is the enrichment topology showing the same
>>>>>>>>>>> issues, but my mind went to ES as well.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes which bolt is reporting all those failures?  My theory is
>>>>>>>>>>>> that there is some ES tuning that needs to be done.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Could I see a little more of that screen?  Specifically what
>>>>>>>>>>>>> the bolts look like.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Casey,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have
>>>>>>>>>>>>>>> increased it to 300s, no changes!
>>>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for
>>>>>>>>>>>>>>> threat triage!
>>>>>>>>>>>>>>> - No, not at all.
>>>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000
>>>>>>>>>>>>>>> which is about to 5MB.
>>>>>>>>>>>>>>> - I have changed the number of executors for the Storm acker
>>>>>>>>>>>>>>> thread, and I have also changed the value of 
>>>>>>>>>>>>>>> topology.max.spout.pending,
>>>>>>>>>>>>>>> still no changes!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment,
>>>>>>>>>>>>>>>> what enrichments do you have in place?
>>>>>>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no
>>>>>>>>>>>>>>>>> errors in the logs.  Would you mind sending over a screenshot 
>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>> indexing topology from the storm UI?  You might not be able 
>>>>>>>>>>>>>>>>> to paste the
>>>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be in 
>>>>>>>>>>>>>>>>> order.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error
>>>>>>>>>>>>>>>>>> topic. Also, the number of tuples is emitted and transferred 
>>>>>>>>>>>>>>>>>> to the error
>>>>>>>>>>>>>>>>>> indexing bolt is zero!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in
>>>>>>>>>>>>>>>>>>> Elasticsearch?  There are several catch blocks across the 
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and 
>>>>>>>>>>>>>>>>>>> forward them on to
>>>>>>>>>>>>>>>>>>> the indexing topology.  If you're not seeing anything in 
>>>>>>>>>>>>>>>>>>> the worker logs
>>>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I
>>>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had 
>>>>>>>>>>>>>>>>>>>> been plenty
>>>>>>>>>>>>>>>>>>>> resources still available!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment 
>>>>>>>>>>>>>>>>>>>>>> topologies are very
>>>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the 
>>>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix
>>>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to 
>>>>>>>>>>>>>>>>>>>>>> 0 to temporarily
>>>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you 
>>>>>>>>>>>>>>>>>>>>>> have any idea what
>>>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure 
>>>>>>>>>>>>>>>>>>>>>> decreases by reducing the
>>>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, 
>>>>>>>>>>>>>>>>>>>>>> it is still high!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> A.Nazemian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> A.Nazemian
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>>
>> --
>> A.Nazemian
>>
>


-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to