Thanks Nathan, that's exactly what I meant. :-) 2015-03-10 17:45 GMT+01:00 Nathan Leung <[email protected]>:
> Storm supports custom schedulers: > http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/ > > On Tue, Mar 10, 2015 at 12:37 PM, Martin Illecker <[email protected]> > wrote: > >> Curtis, I have made exactly the same observations. I have decreased the >> max spout pending to eliminate tuple timeouts. >> But this actually means throttling the whole topology because of one bolt >> with a high latency! (e.g., 5 bolts with 0.1 ms latency and 1 bolt with 1 >> ms) >> >> At some point, increasing the parallelism of the high latency bolt will >> impact the overall performance of a worker. There has to be a better way. >> >> A possible solution might be to assign a bolt to a specific worker. >> Currently, if I assume correctly, each bolt is evenly distributed among >> multiple workers. >> (e.g., a bolt with parallelism 10 can be executed by 5 threads on 2 >> workers or 2 threads on 5 workers) >> >> If a bolt could be assigned to a specific worker type, then it would be >> possible to add more workers / nodes, which exclusively execute multiple >> threads of a high latency bolt. >> For example, we could have one worker, which executes a high latency bolt >> and another worker, which executes the rest of the topology. >> So the default behavior would be evenly distribute the bolts but it >> should be possible to define different worker types and assign a bolt to >> these worker types. >> >> Does this make any sense? >> And could this be an additional feature of Storm? >> >> 2015-03-10 16:59 GMT+01:00 Curtis Allen <[email protected]>: >> >>> Idan, Use the Config class >>> https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/Config.java#L1295 >>> >>> On Tue, Mar 10, 2015 at 9:49 AM Idan Fridman <[email protected]> >>> wrote: >>> >>>> curtis, how do you set the storm.message.timeout.secs? >>>> >>>> 2015-03-10 17:07 GMT+02:00 Curtis Allen <[email protected]>: >>>> >>>>> Tuning an topology that contains bolts that have a unpredictable >>>>> execute latency is extremely difficult. I've had to slow down the entire >>>>> topology by increasing the storm.max.spout.pending and >>>>> storm.message.timeout.secs otherwise you'll have tuples queue up and >>>>> timeout. >>>>> >>>> >>>> On Tue, Mar 10, 2015 at 8:53 AM Martin Illecker <[email protected]> >>>> wrote: >>>> >>>>> I would be interested in a solution for high latency bolts as well. >>>>> >>>>> Maybe a custom scheduler, which prioritizes high latency bolts might >>>>> help? >>>>> (e.g., allowing a worker to exclusively run high latency bolts) >>>>> >>>>> Does anyone have a working solution for a high-throughput topology >>>>> (x0000 tuples / sec) including a HTTPClient bolt (latency around 100ms)? >>>>> >>>>> >>>>> 2015-03-08 20:35 GMT+01:00 Frank Jania <[email protected]>: >>>>> >>>>>> I've been running storm successfully now for a while with a fairly >>>>>> simple topology of this form: >>>>>> >>>>>> spout with a stream of tweets --> bolt to check tweet user against >>>>>> cache --> bolts to do some persistence based on tweet content. >>>>>> >>>>>> So far that's been humming along quite well with execute latencies in >>>>>> low single digit or sub millisecond. Other than setting the parallelism >>>>>> for >>>>>> various bolts, I've been able to run it the default topology config >>>>>> pretty >>>>>> well. >>>>>> >>>>>> Now I'm trying a topology of the form: >>>>>> >>>>>> spout with a stream of tweets --> bolt to extract the urls in the >>>>>> tweet --> bolt to fetch the url and get the page's title. >>>>>> >>>>>> For this topology the "fetch" portion can have a much longer latency, >>>>>> I'm seeing execute latencies in the 300-500ms range to accommodate the >>>>>> fetch of any of these arbitrary urls. I've implemented caching to avoid >>>>>> fetching urls I already have titles for and using socket/connection >>>>>> timeouts to keep fetches from hanging for too long, but even still, this >>>>>> is >>>>>> going to be a bottleneck. >>>>>> >>>>>> I've set the parallelism for the fetch bolt fairly high already, but >>>>>> are there any best practices for configuring a topology like this where >>>>>> at >>>>>> least one bolt is going to take much more time to process than the rest? >>>>>> >>>>> >>>>> >> >
