I've been running storm successfully now for a while with a fairly simple
topology of this form:

spout with a stream of tweets --> bolt to check tweet user against cache
--> bolts to do some persistence based on tweet content.

So far that's been humming along quite well with execute latencies in low
single digit or sub millisecond. Other than setting the parallelism for
various bolts, I've been able to run it the default topology config pretty
well.

Now I'm trying a topology of the form:

spout with a stream of tweets --> bolt to extract the urls in the tweet -->
bolt to fetch the url and get the page's title.

For this topology the "fetch" portion can have a much longer latency, I'm
seeing execute latencies in the 300-500ms range to accommodate the fetch of
any of these arbitrary urls. I've implemented caching to avoid fetching
urls I already have titles for and using socket/connection timeouts to keep
fetches from hanging for too long, but even still, this is going to be a
bottleneck.

I've set the parallelism for the fetch bolt fairly high already, but are
there any best practices for configuring a topology like this where at
least one bolt is going to take much more time to process than the rest?

Reply via email to