Re: High percentage of failed/timed out tuples after performance tuning!

Ali Nazemian Sat, 22 Apr 2017 08:26:50 -0700

Sorry, I have noticed even with 1 Kafka spout the number of failures is
very high! The parameters I was using in the previous test were wrong.
Please ignore what I have said about having less number of Kafka spouts.
The problem still exists. The only way I could decrease the rate of failure
is through disabling Storm reliability!!
Another wired fact is I have this problem only for the enrichments and
indexing topologies. All of the parsers are fine!


On Sun, Apr 23, 2017 at 12:39 AM, Ali Nazemian <[email protected]>
wrote:

> In response to your question for decreasing the value of spout pending, no
> even with the value of 10 failure ratio was the same. However, throughput
> dropped significantly.
>
> On Sun, Apr 23, 2017 at 12:27 AM, Ali Nazemian <[email protected]>
> wrote:
>
>> I have noticed if I decrease the parallelism for spouts the failure rate
>> will drop significantly! The interesting part is throughput is almost the
>> same! I was using the same value for spout parallelism as the number of
>> partitions for the corresponding Kafka topic which I thought it is a best
>> practice for spout parallelism.
>> The failure rate has been dropped to zero for the spout parallelism value
>> half of the number of partitions! I think this happened probably due to the
>> fact that Kafka and Storm have been collocated on the same host. I was
>> using 1 acker executor per worker.
>>
>> On Sat, Apr 22, 2017 at 11:55 PM, Casey Stella <[email protected]>
>> wrote:
>>
>>> So what is perplexing is that the latency is low and the capacity for
>>> each bolt is less than 1, so it's keeping up. I would have expected this
>>> kind of thing if the latency was high and timeouts were happening.
>>>
>>> If you drop the spout pending config lower do you get to a point with no
>>> errors (at obvious consequences to throughput)? Also how many ackers are
>>> you running?
>>>
>>> On Sat, Apr 22, 2017 at 00:50 Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> I have disabled the reliability retry by setting the number of
>>>> acker executors to zero. I can see based on the number of tuples have been
>>>> emitted on the indexing topologies and the number of documents in
>>>> Elasticsearch there is almost no missing document. It seems for some reason
>>>> acker executors can not pick the acknowledgement for indexing and
>>>> enrichments topology. However, it can be seen at the destination of those
>>>> topologies.
>>>>
>>>> I am also wondering where the best approach would be to find the failed
>>>> tuples? I though I can find them in the corresponding error topics which
>>>> seem to be not like that.
>>>>
>>>> On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]>
>>>> wrote:
>>>>
>>>>> Is the following fact rings any bell?
>>>>>
>>>>> There is no failure at the bolt level acknowledgement, but from the
>>>>> topology status, the rate of failure is very high! This is the same
>>>>> scenario for both indexing and enrichment topologies.
>>>>>
>>>>> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The value for topology.max.spout.pending is 1000 currently. I did
>>>>>> decrease it previously to understand the effect of that value on my
>>>>>> problem. Clearly, throughput dropped, but still a very high rate of 
>>>>>> failure!
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, so ignoring the indexing topology, the fact that you're seeing
>>>>>>> failures in the enrichment topology, which has no ES component, is
>>>>>>> telling.  It's also telling that the enrichment topology stats are
>>>>>>> perfectly sensible latency-wise (i.e. it's not sweating).
>>>>>>>
>>>>>>> What's your storm configuration for topology.max.spout.pending?  If
>>>>>>> it's not set, then try setting it to 1000 and bouncing the topologies.
>>>>>>>
>>>>>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> No, nothing ...
>>>>>>>>
>>>>>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Anything going on in the kafka broker logs?
>>>>>>>>>
>>>>>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Although this is a test platform with a way less spec than
>>>>>>>>>> production, it should be enough for indexing 600 docs per second. I 
>>>>>>>>>> have
>>>>>>>>>> seen benchmark result of 150-200k docs per second with this spec! I 
>>>>>>>>>> haven't
>>>>>>>>>> played with tuning the template yet, but I still think the current 
>>>>>>>>>> rate
>>>>>>>>>> does not make sense at all.
>>>>>>>>>>
>>>>>>>>>> I have changed the batch size to 100. Throughput has been
>>>>>>>>>> dropped, but still a very high rate of failure!
>>>>>>>>>>
>>>>>>>>>> Please find the screenshots for the enrichments:
>>>>>>>>>> http://imgur.com/a/ceC8f
>>>>>>>>>> http://imgur.com/a/sBQwM
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, yeah, those latencies are pretty high.  I think what's
>>>>>>>>>>> happening is that the tuples aren't being acked fast enough and are 
>>>>>>>>>>> timing
>>>>>>>>>>> out.  How taxed is your ES box?  Can you drop the batch size down 
>>>>>>>>>>> to maybe
>>>>>>>>>>> 100 and see what happens?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Please find the bolt part of Storm-UI related to indexing
>>>>>>>>>>>> topology:
>>>>>>>>>>>>
>>>>>>>>>>>> http://imgur.com/a/tFkmO
>>>>>>>>>>>>
>>>>>>>>>>>> As you can see a hdfs error has also appeared which is not
>>>>>>>>>>>> important right now.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What's curious is the enrichment topology showing the same
>>>>>>>>>>>>> issues, but my mind went to ES as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes which bolt is reporting all those failures?  My theory is
>>>>>>>>>>>>>> that there is some ES tuning that needs to be done.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Could I see a little more of that screen?  Specifically what
>>>>>>>>>>>>>>> the bolts look like.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please find the storm-UI screenshot as follows.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://imgur.com/FhIrGFd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Casey,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have
>>>>>>>>>>>>>>>>> increased it to 300s, no changes!
>>>>>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for
>>>>>>>>>>>>>>>>> threat triage!
>>>>>>>>>>>>>>>>> - No, not at all.
>>>>>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000
>>>>>>>>>>>>>>>>> which is about to 5MB.
>>>>>>>>>>>>>>>>> - I have changed the number of executors for the Storm
>>>>>>>>>>>>>>>>> acker thread, and I have also changed the value of
>>>>>>>>>>>>>>>>> topology.max.spout.pending, still no changes!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also,
>>>>>>>>>>>>>>>>>> * what's your setting for topology.message.timeout?
>>>>>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment,
>>>>>>>>>>>>>>>>>> what enrichments do you have in place?
>>>>>>>>>>>>>>>>>> * Is ES being taxed heavily?
>>>>>>>>>>>>>>>>>> * What's your ES batch size for the sensor?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no
>>>>>>>>>>>>>>>>>>> errors in the logs.  Would you mind sending over a 
>>>>>>>>>>>>>>>>>>> screenshot of the
>>>>>>>>>>>>>>>>>>> indexing topology from the storm UI?  You might not be able 
>>>>>>>>>>>>>>>>>>> to paste the
>>>>>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be 
>>>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error
>>>>>>>>>>>>>>>>>>>> topic. Also, the number of tuples is emitted and 
>>>>>>>>>>>>>>>>>>>> transferred to the error
>>>>>>>>>>>>>>>>>>>> indexing bolt is zero!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in
>>>>>>>>>>>>>>>>>>>>> Elasticsearch?  There are several catch blocks across the 
>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and 
>>>>>>>>>>>>>>>>>>>>> forward them on to
>>>>>>>>>>>>>>>>>>>>> the indexing topology.  If you're not seeing anything in 
>>>>>>>>>>>>>>>>>>>>> the worker logs
>>>>>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I
>>>>>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had 
>>>>>>>>>>>>>>>>>>>>>> been plenty
>>>>>>>>>>>>>>>>>>>>>> resources still available!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have
>>>>>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the 
>>>>>>>>>>>>>>>>>>>>>>>> indexing/enrichment topologies are very
>>>>>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in 
>>>>>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have
>>>>>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the 
>>>>>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix
>>>>>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors 
>>>>>>>>>>>>>>>>>>>>>>>> to 0 to temporarily
>>>>>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do 
>>>>>>>>>>>>>>>>>>>>>>>> you have any idea what
>>>>>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure 
>>>>>>>>>>>>>>>>>>>>>>>> decreases by reducing the
>>>>>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any 
>>>>>>>>>>>>>>>>>>>>>>>> parallelism, it is still high!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>> Ali
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> A.Nazemian
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A.Nazemian
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> A.Nazemian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> A.Nazemian
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> A.Nazemian
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: High percentage of failed/timed out tuples after performance tuning!

Reply via email to