I would be interested in a solution for high latency bolts as well. Maybe a custom scheduler, which prioritizes high latency bolts might help? (e.g., allowing a worker to exclusively run high latency bolts)
Does anyone have a working solution for a high-throughput topology (x0000 tuples / sec) including a HTTPClient bolt (latency around 100ms)? 2015-03-08 20:35 GMT+01:00 Frank Jania <[email protected]>: > I've been running storm successfully now for a while with a fairly simple > topology of this form: > > spout with a stream of tweets --> bolt to check tweet user against cache > --> bolts to do some persistence based on tweet content. > > So far that's been humming along quite well with execute latencies in low > single digit or sub millisecond. Other than setting the parallelism for > various bolts, I've been able to run it the default topology config pretty > well. > > Now I'm trying a topology of the form: > > spout with a stream of tweets --> bolt to extract the urls in the tweet > --> bolt to fetch the url and get the page's title. > > For this topology the "fetch" portion can have a much longer latency, I'm > seeing execute latencies in the 300-500ms range to accommodate the fetch of > any of these arbitrary urls. I've implemented caching to avoid fetching > urls I already have titles for and using socket/connection timeouts to keep > fetches from hanging for too long, but even still, this is going to be a > bottleneck. > > I've set the parallelism for the fetch bolt fairly high already, but are > there any best practices for configuring a topology like this where at > least one bolt is going to take much more time to process than the rest? >
