Thanks Martin and Nathan! Didn't know about the custom schedulers. On Tue, Mar 10, 2015 at 10:54 AM Martin Illecker <[email protected]> wrote:
> Thanks Nathan, that's exactly what I meant. :-) > > 2015-03-10 17:45 GMT+01:00 Nathan Leung <[email protected]>: > >> Storm supports custom schedulers: >> http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/ >> >> On Tue, Mar 10, 2015 at 12:37 PM, Martin Illecker <[email protected]> >> wrote: >> >>> Curtis, I have made exactly the same observations. I have decreased the >>> max spout pending to eliminate tuple timeouts. >>> But this actually means throttling the whole topology because of one >>> bolt with a high latency! (e.g., 5 bolts with 0.1 ms latency and 1 bolt >>> with 1 ms) >>> >>> At some point, increasing the parallelism of the high latency bolt will >>> impact the overall performance of a worker. There has to be a better way. >>> >>> A possible solution might be to assign a bolt to a specific worker. >>> Currently, if I assume correctly, each bolt is evenly distributed among >>> multiple workers. >>> (e.g., a bolt with parallelism 10 can be executed by 5 threads on 2 >>> workers or 2 threads on 5 workers) >>> >>> If a bolt could be assigned to a specific worker type, then it would be >>> possible to add more workers / nodes, which exclusively execute multiple >>> threads of a high latency bolt. >>> For example, we could have one worker, which executes a high latency >>> bolt and another worker, which executes the rest of the topology. >>> So the default behavior would be evenly distribute the bolts but it >>> should be possible to define different worker types and assign a bolt to >>> these worker types. >>> >>> Does this make any sense? >>> And could this be an additional feature of Storm? >>> >>> 2015-03-10 16:59 GMT+01:00 Curtis Allen <[email protected]>: >>> >>>> Idan, Use the Config class >>>> https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/Config.java#L1295 >>>> >>>> On Tue, Mar 10, 2015 at 9:49 AM Idan Fridman <[email protected]> >>>> wrote: >>>> >>>>> curtis, how do you set the storm.message.timeout.secs? >>>>> >>>>> 2015-03-10 17:07 GMT+02:00 Curtis Allen <[email protected]>: >>>>> >>>>>> Tuning an topology that contains bolts that have a unpredictable >>>>>> execute latency is extremely difficult. I've had to slow down the entire >>>>>> topology by increasing the storm.max.spout.pending and >>>>>> storm.message.timeout.secs otherwise you'll have tuples queue up and >>>>>> timeout. >>>>>> >>>>> >>>>> On Tue, Mar 10, 2015 at 8:53 AM Martin Illecker <[email protected]> >>>>> wrote: >>>>> >>>>>> I would be interested in a solution for high latency bolts as well. >>>>>> >>>>>> Maybe a custom scheduler, which prioritizes high latency bolts might >>>>>> help? >>>>>> (e.g., allowing a worker to exclusively run high latency bolts) >>>>>> >>>>>> Does anyone have a working solution for a high-throughput topology >>>>>> (x0000 tuples / sec) including a HTTPClient bolt (latency around 100ms)? >>>>>> >>>>>> >>>>>> 2015-03-08 20:35 GMT+01:00 Frank Jania <[email protected]>: >>>>>> >>>>>>> I've been running storm successfully now for a while with a fairly >>>>>>> simple topology of this form: >>>>>>> >>>>>>> spout with a stream of tweets --> bolt to check tweet user against >>>>>>> cache --> bolts to do some persistence based on tweet content. >>>>>>> >>>>>>> So far that's been humming along quite well with execute latencies >>>>>>> in low single digit or sub millisecond. Other than setting the >>>>>>> parallelism >>>>>>> for various bolts, I've been able to run it the default topology config >>>>>>> pretty well. >>>>>>> >>>>>>> Now I'm trying a topology of the form: >>>>>>> >>>>>>> spout with a stream of tweets --> bolt to extract the urls in the >>>>>>> tweet --> bolt to fetch the url and get the page's title. >>>>>>> >>>>>>> For this topology the "fetch" portion can have a much longer >>>>>>> latency, I'm seeing execute latencies in the 300-500ms range to >>>>>>> accommodate >>>>>>> the fetch of any of these arbitrary urls. I've implemented caching to >>>>>>> avoid >>>>>>> fetching urls I already have titles for and using socket/connection >>>>>>> timeouts to keep fetches from hanging for too long, but even still, >>>>>>> this is >>>>>>> going to be a bottleneck. >>>>>>> >>>>>>> I've set the parallelism for the fetch bolt fairly high already, but >>>>>>> are there any best practices for configuring a topology like this where >>>>>>> at >>>>>>> least one bolt is going to take much more time to process than the rest? >>>>>>> >>>>>> >>>>>> >>> >> >
