Sorry, I have noticed even with 1 Kafka spout the number of failures is very high! The parameters I was using in the previous test were wrong. Please ignore what I have said about having less number of Kafka spouts. The problem still exists. The only way I could decrease the rate of failure is through disabling Storm reliability!! Another wired fact is I have this problem only for the enrichments and indexing topologies. All of the parsers are fine!
On Sun, Apr 23, 2017 at 12:39 AM, Ali Nazemian <[email protected]> wrote: > In response to your question for decreasing the value of spout pending, no > even with the value of 10 failure ratio was the same. However, throughput > dropped significantly. > > On Sun, Apr 23, 2017 at 12:27 AM, Ali Nazemian <[email protected]> > wrote: > >> I have noticed if I decrease the parallelism for spouts the failure rate >> will drop significantly! The interesting part is throughput is almost the >> same! I was using the same value for spout parallelism as the number of >> partitions for the corresponding Kafka topic which I thought it is a best >> practice for spout parallelism. >> The failure rate has been dropped to zero for the spout parallelism value >> half of the number of partitions! I think this happened probably due to the >> fact that Kafka and Storm have been collocated on the same host. I was >> using 1 acker executor per worker. >> >> On Sat, Apr 22, 2017 at 11:55 PM, Casey Stella <[email protected]> >> wrote: >> >>> So what is perplexing is that the latency is low and the capacity for >>> each bolt is less than 1, so it's keeping up. I would have expected this >>> kind of thing if the latency was high and timeouts were happening. >>> >>> If you drop the spout pending config lower do you get to a point with no >>> errors (at obvious consequences to throughput)? Also how many ackers are >>> you running? >>> >>> On Sat, Apr 22, 2017 at 00:50 Ali Nazemian <[email protected]> >>> wrote: >>> >>>> I have disabled the reliability retry by setting the number of >>>> acker executors to zero. I can see based on the number of tuples have been >>>> emitted on the indexing topologies and the number of documents in >>>> Elasticsearch there is almost no missing document. It seems for some reason >>>> acker executors can not pick the acknowledgement for indexing and >>>> enrichments topology. However, it can be seen at the destination of those >>>> topologies. >>>> >>>> I am also wondering where the best approach would be to find the failed >>>> tuples? I though I can find them in the corresponding error topics which >>>> seem to be not like that. >>>> >>>> On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]> >>>> wrote: >>>> >>>>> Is the following fact rings any bell? >>>>> >>>>> There is no failure at the bolt level acknowledgement, but from the >>>>> topology status, the rate of failure is very high! This is the same >>>>> scenario for both indexing and enrichment topologies. >>>>> >>>>> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]> >>>>> wrote: >>>>> >>>>>> The value for topology.max.spout.pending is 1000 currently. I did >>>>>> decrease it previously to understand the effect of that value on my >>>>>> problem. Clearly, throughput dropped, but still a very high rate of >>>>>> failure! >>>>>> >>>>>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Ok, so ignoring the indexing topology, the fact that you're seeing >>>>>>> failures in the enrichment topology, which has no ES component, is >>>>>>> telling. It's also telling that the enrichment topology stats are >>>>>>> perfectly sensible latency-wise (i.e. it's not sweating). >>>>>>> >>>>>>> What's your storm configuration for topology.max.spout.pending? If >>>>>>> it's not set, then try setting it to 1000 and bouncing the topologies. >>>>>>> >>>>>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> No, nothing ... >>>>>>>> >>>>>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Anything going on in the kafka broker logs? >>>>>>>>> >>>>>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Although this is a test platform with a way less spec than >>>>>>>>>> production, it should be enough for indexing 600 docs per second. I >>>>>>>>>> have >>>>>>>>>> seen benchmark result of 150-200k docs per second with this spec! I >>>>>>>>>> haven't >>>>>>>>>> played with tuning the template yet, but I still think the current >>>>>>>>>> rate >>>>>>>>>> does not make sense at all. >>>>>>>>>> >>>>>>>>>> I have changed the batch size to 100. Throughput has been >>>>>>>>>> dropped, but still a very high rate of failure! >>>>>>>>>> >>>>>>>>>> Please find the screenshots for the enrichments: >>>>>>>>>> http://imgur.com/a/ceC8f >>>>>>>>>> http://imgur.com/a/sBQwM >>>>>>>>>> >>>>>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Ok, yeah, those latencies are pretty high. I think what's >>>>>>>>>>> happening is that the tuples aren't being acked fast enough and are >>>>>>>>>>> timing >>>>>>>>>>> out. How taxed is your ES box? Can you drop the batch size down >>>>>>>>>>> to maybe >>>>>>>>>>> 100 and see what happens? >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Please find the bolt part of Storm-UI related to indexing >>>>>>>>>>>> topology: >>>>>>>>>>>> >>>>>>>>>>>> http://imgur.com/a/tFkmO >>>>>>>>>>>> >>>>>>>>>>>> As you can see a hdfs error has also appeared which is not >>>>>>>>>>>> important right now. >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> What's curious is the enrichment topology showing the same >>>>>>>>>>>>> issues, but my mind went to ES as well. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes which bolt is reporting all those failures? My theory is >>>>>>>>>>>>>> that there is some ES tuning that needs to be done. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Could I see a little more of that screen? Specifically what >>>>>>>>>>>>>>> the bolts look like. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please find the storm-UI screenshot as follows. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://imgur.com/FhIrGFd >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Casey, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have >>>>>>>>>>>>>>>>> increased it to 300s, no changes! >>>>>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for >>>>>>>>>>>>>>>>> threat triage! >>>>>>>>>>>>>>>>> - No, not at all. >>>>>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000 >>>>>>>>>>>>>>>>> which is about to 5MB. >>>>>>>>>>>>>>>>> - I have changed the number of executors for the Storm >>>>>>>>>>>>>>>>> acker thread, and I have also changed the value of >>>>>>>>>>>>>>>>> topology.max.spout.pending, still no changes! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also, >>>>>>>>>>>>>>>>>> * what's your setting for topology.message.timeout? >>>>>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment, >>>>>>>>>>>>>>>>>> what enrichments do you have in place? >>>>>>>>>>>>>>>>>> * Is ES being taxed heavily? >>>>>>>>>>>>>>>>>> * What's your ES batch size for the sensor? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no >>>>>>>>>>>>>>>>>>> errors in the logs. Would you mind sending over a >>>>>>>>>>>>>>>>>>> screenshot of the >>>>>>>>>>>>>>>>>>> indexing topology from the storm UI? You might not be able >>>>>>>>>>>>>>>>>>> to paste the >>>>>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be >>>>>>>>>>>>>>>>>>> in order. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Casey >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error >>>>>>>>>>>>>>>>>>>> topic. Also, the number of tuples is emitted and >>>>>>>>>>>>>>>>>>>> transferred to the error >>>>>>>>>>>>>>>>>>>> indexing bolt is zero! >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in >>>>>>>>>>>>>>>>>>>>> Elasticsearch? There are several catch blocks across the >>>>>>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and >>>>>>>>>>>>>>>>>>>>> forward them on to >>>>>>>>>>>>>>>>>>>>> the indexing topology. If you're not seeing anything in >>>>>>>>>>>>>>>>>>>>> the worker logs >>>>>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I >>>>>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had >>>>>>>>>>>>>>>>>>>>>> been plenty >>>>>>>>>>>>>>>>>>>>>> resources still available! >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have >>>>>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the >>>>>>>>>>>>>>>>>>>>>>>> indexing/enrichment topologies are very >>>>>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in >>>>>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have >>>>>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the >>>>>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix >>>>>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors >>>>>>>>>>>>>>>>>>>>>>>> to 0 to temporarily >>>>>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do >>>>>>>>>>>>>>>>>>>>>>>> you have any idea what >>>>>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure >>>>>>>>>>>>>>>>>>>>>>>> decreases by reducing the >>>>>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any >>>>>>>>>>>>>>>>>>>>>>>> parallelism, it is still high! >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>> Ali >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> A.Nazemian >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> A.Nazemian >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> A.Nazemian >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> A.Nazemian >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> A.Nazemian >>>>> >>>> >>>> >>>> >>>> -- >>>> A.Nazemian >>>> >>> >> >> >> -- >> A.Nazemian >> > > > > -- > A.Nazemian > -- A.Nazemian
