I have noticed if I decrease the parallelism for spouts the failure rate will drop significantly! The interesting part is throughput is almost the same! I was using the same value for spout parallelism as the number of partitions for the corresponding Kafka topic which I thought it is a best practice for spout parallelism. The failure rate has been dropped to zero for the spout parallelism value half of the number of partitions! I think this happened probably due to the fact that Kafka and Storm have been collocated on the same host. I was using 1 acker executor per worker.
On Sat, Apr 22, 2017 at 11:55 PM, Casey Stella <[email protected]> wrote: > So what is perplexing is that the latency is low and the capacity for each > bolt is less than 1, so it's keeping up. I would have expected this kind of > thing if the latency was high and timeouts were happening. > > If you drop the spout pending config lower do you get to a point with no > errors (at obvious consequences to throughput)? Also how many ackers are > you running? > > On Sat, Apr 22, 2017 at 00:50 Ali Nazemian <[email protected]> wrote: > >> I have disabled the reliability retry by setting the number of >> acker executors to zero. I can see based on the number of tuples have been >> emitted on the indexing topologies and the number of documents in >> Elasticsearch there is almost no missing document. It seems for some reason >> acker executors can not pick the acknowledgement for indexing and >> enrichments topology. However, it can be seen at the destination of those >> topologies. >> >> I am also wondering where the best approach would be to find the failed >> tuples? I though I can find them in the corresponding error topics which >> seem to be not like that. >> >> On Sat, Apr 22, 2017 at 2:36 PM, Ali Nazemian <[email protected]> >> wrote: >> >>> Is the following fact rings any bell? >>> >>> There is no failure at the bolt level acknowledgement, but from the >>> topology status, the rate of failure is very high! This is the same >>> scenario for both indexing and enrichment topologies. >>> >>> On Sat, Apr 22, 2017 at 2:29 PM, Ali Nazemian <[email protected]> >>> wrote: >>> >>>> The value for topology.max.spout.pending is 1000 currently. I did >>>> decrease it previously to understand the effect of that value on my >>>> problem. Clearly, throughput dropped, but still a very high rate of >>>> failure! >>>> >>>> On Sat, Apr 22, 2017 at 3:12 AM, Casey Stella <[email protected]> >>>> wrote: >>>> >>>>> Ok, so ignoring the indexing topology, the fact that you're seeing >>>>> failures in the enrichment topology, which has no ES component, is >>>>> telling. It's also telling that the enrichment topology stats are >>>>> perfectly sensible latency-wise (i.e. it's not sweating). >>>>> >>>>> What's your storm configuration for topology.max.spout.pending? If >>>>> it's not set, then try setting it to 1000 and bouncing the topologies. >>>>> >>>>> On Fri, Apr 21, 2017 at 12:54 PM, Ali Nazemian <[email protected]> >>>>> wrote: >>>>> >>>>>> No, nothing ... >>>>>> >>>>>> On Sat, Apr 22, 2017 at 2:46 AM, Casey Stella <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Anything going on in the kafka broker logs? >>>>>>> >>>>>>> On Fri, Apr 21, 2017 at 12:24 PM, Ali Nazemian < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Although this is a test platform with a way less spec than >>>>>>>> production, it should be enough for indexing 600 docs per second. I >>>>>>>> have >>>>>>>> seen benchmark result of 150-200k docs per second with this spec! I >>>>>>>> haven't >>>>>>>> played with tuning the template yet, but I still think the current rate >>>>>>>> does not make sense at all. >>>>>>>> >>>>>>>> I have changed the batch size to 100. Throughput has been dropped, >>>>>>>> but still a very high rate of failure! >>>>>>>> >>>>>>>> Please find the screenshots for the enrichments: >>>>>>>> http://imgur.com/a/ceC8f >>>>>>>> http://imgur.com/a/sBQwM >>>>>>>> >>>>>>>> On Sat, Apr 22, 2017 at 2:08 AM, Casey Stella <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok, yeah, those latencies are pretty high. I think what's >>>>>>>>> happening is that the tuples aren't being acked fast enough and are >>>>>>>>> timing >>>>>>>>> out. How taxed is your ES box? Can you drop the batch size down to >>>>>>>>> maybe >>>>>>>>> 100 and see what happens? >>>>>>>>> >>>>>>>>> On Fri, Apr 21, 2017 at 12:05 PM, Ali Nazemian < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Please find the bolt part of Storm-UI related to indexing >>>>>>>>>> topology: >>>>>>>>>> >>>>>>>>>> http://imgur.com/a/tFkmO >>>>>>>>>> >>>>>>>>>> As you can see a hdfs error has also appeared which is not >>>>>>>>>> important right now. >>>>>>>>>> >>>>>>>>>> On Sat, Apr 22, 2017 at 1:59 AM, Casey Stella <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> What's curious is the enrichment topology showing the same >>>>>>>>>>> issues, but my mind went to ES as well. >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 21, 2017 at 11:57 AM, Ryan Merriman < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes which bolt is reporting all those failures? My theory is >>>>>>>>>>>> that there is some ES tuning that needs to be done. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:53 AM, Casey Stella < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Could I see a little more of that screen? Specifically what >>>>>>>>>>>>> the bolts look like. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 21, 2017 at 11:51 AM, Ali Nazemian < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Please find the storm-UI screenshot as follows. >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://imgur.com/FhIrGFd >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:41 AM, Ali Nazemian < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Casey, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - topology.message.timeout: It was 30s at first. I have >>>>>>>>>>>>>>> increased it to 300s, no changes! >>>>>>>>>>>>>>> - It is a very basic geo-enrichment and simple rule for >>>>>>>>>>>>>>> threat triage! >>>>>>>>>>>>>>> - No, not at all. >>>>>>>>>>>>>>> - I have changed that to find the best value. it is 5000 >>>>>>>>>>>>>>> which is about to 5MB. >>>>>>>>>>>>>>> - I have changed the number of executors for the Storm acker >>>>>>>>>>>>>>> thread, and I have also changed the value of >>>>>>>>>>>>>>> topology.max.spout.pending, >>>>>>>>>>>>>>> still no changes! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 1:24 AM, Casey Stella < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, >>>>>>>>>>>>>>>> * what's your setting for topology.message.timeout? >>>>>>>>>>>>>>>> * You said you're seeing this in indexing and enrichment, >>>>>>>>>>>>>>>> what enrichments do you have in place? >>>>>>>>>>>>>>>> * Is ES being taxed heavily? >>>>>>>>>>>>>>>> * What's your ES batch size for the sensor? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:46 AM, Casey Stella < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So you're seeing failures in the storm topology but no >>>>>>>>>>>>>>>>> errors in the logs. Would you mind sending over a screenshot >>>>>>>>>>>>>>>>> of the >>>>>>>>>>>>>>>>> indexing topology from the storm UI? You might not be able >>>>>>>>>>>>>>>>> to paste the >>>>>>>>>>>>>>>>> image on the mailing list, so maybe an imgur link would be in >>>>>>>>>>>>>>>>> order. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Casey >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:34 AM, Ali Nazemian < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> No, I cannot see any error inside the indexing error >>>>>>>>>>>>>>>>>> topic. Also, the number of tuples is emitted and transferred >>>>>>>>>>>>>>>>>> to the error >>>>>>>>>>>>>>>>>> indexing bolt is zero! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, Apr 22, 2017 at 12:29 AM, Ryan Merriman < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Do you see any errors in the error* index in >>>>>>>>>>>>>>>>>>> Elasticsearch? There are several catch blocks across the >>>>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>> topologies that transform errors into json objects and >>>>>>>>>>>>>>>>>>> forward them on to >>>>>>>>>>>>>>>>>>> the indexing topology. If you're not seeing anything in >>>>>>>>>>>>>>>>>>> the worker logs >>>>>>>>>>>>>>>>>>> it's likely the errors were captured there instead. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 9:19 AM, Ali Nazemian < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> No everything is fine at the log level. Also, when I >>>>>>>>>>>>>>>>>>>> checked resource consumption at the workers, there had >>>>>>>>>>>>>>>>>>>> been plenty >>>>>>>>>>>>>>>>>>>> resources still available! >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 10:04 PM, Casey Stella < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Seeing anything in the storm logs for the workers? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Apr 21, 2017 at 07:41 Ali Nazemian < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> After I tried to tune the Metron performance I have >>>>>>>>>>>>>>>>>>>>>> noticed the rate of failure for the indexing/enrichment >>>>>>>>>>>>>>>>>>>>>> topologies are very >>>>>>>>>>>>>>>>>>>>>> high (about 95%). However, I can see the messages in >>>>>>>>>>>>>>>>>>>>>> Elasticsearch. I have >>>>>>>>>>>>>>>>>>>>>> tried to increase the timeout value for the >>>>>>>>>>>>>>>>>>>>>> acknowledgement. It didn't fix >>>>>>>>>>>>>>>>>>>>>> the problem. I can set the number of acker executors to >>>>>>>>>>>>>>>>>>>>>> 0 to temporarily >>>>>>>>>>>>>>>>>>>>>> fix the problem which is not a good idea at all. Do you >>>>>>>>>>>>>>>>>>>>>> have any idea what >>>>>>>>>>>>>>>>>>>>>> have caused such issue? The percentage of failure >>>>>>>>>>>>>>>>>>>>>> decreases by reducing the >>>>>>>>>>>>>>>>>>>>>> number of parallelism, but even without any parallelism, >>>>>>>>>>>>>>>>>>>>>> it is still high! >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>> Ali >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> A.Nazemian >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> A.Nazemian >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> A.Nazemian >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> A.Nazemian >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> A.Nazemian >>>> >>> >>> >>> >>> -- >>> A.Nazemian >>> >> >> >> >> -- >> A.Nazemian >> > -- A.Nazemian
