Hi, I'm currently experimenting with Storm in order to figure out whether it is the right fit for my project, and I would like to seek other user's opinions on this, as the tests I'm currently doing are getting costlier and costlier (I'm now working on setting up a full scale cluster and trying to deploy topologies on it to measure deployment time).
My project relies heavily on parallel computation of map/reduce patterns. However, I'm not interested in having an infinite lifespan for each of my topologies as in practice, I'm working with bounded streams of data (thus my topologies are relatively ephemeral). This is the main (obvious) reason why I'm doubting my approach here. However, each topology will handle tons and tons of tuples and there are many features in Storm that would be valuable to me, which is why I'm trying to see if my cube would somehow fit into this round hole, and how far it is from fitting : - the clean scalability and worker flexibility (re-balancing) - reliability of message processing (acks) & exactly one semantics - the clean API and interfaces for setting up my topology, and how the abstractions fit together - the apparently low overhead resulting from these features (haven't tested that thoroughly first hand but I guess I'm believing social proof) Now, a couple of important aspects which make me think there might be hope here : - in my primary use case, even though I would be deploying and killing topologies frequently, the *implementation* of the spouts and bolts *wouldn't change*. This means, the java code essentially doesn't change, except for the part where I'm connecting the legos together (the topology object itself, I guess). Which I don't believe would necessarily be a big deal to deploy dynamically (is the new jar mendatory or can't we just serialize the topology and ship it?). - I'm not interested in *modifying* a topology dynamically. If a given topology doesn't work for what I'm trying to do, I'm willing to pay the cost of deploying a new one. There might be a point where I would look into caching the topologies (i.e letting them run longer / forever) which match my most frequent use-cases, but that'd just be optimizing. So at this point I'm seeing two problems standing in the way and resulting from this potential "misuse" : 1- the boot-time of a topology (not the initial boot time which includes starting up zookeeper and everything, just the deployment and start-up time of each topology, i.e *time-to-first-tuple*) 2- high parallelism of topologies themselves (not just the tasks, but the overhead of managing them) : could that be a problem, say, for nimbus? does it scale well when constantly turning on and killing topologies? To test the boot-time, I'm currently working on a test where I'll programmatically deploy topologies on a large cluster and see what the time penalty is (I'm afraid of the jar part mostly). And then I'll automate that and see if the frequent kill/deploys disturb the system. I'm wondering if the answer here is "just use hadoop" which I haven't played with yet (in this case : are hadoop deployment or boot times better?). In any case, any input or feedback (preferably based on experience) is greatly appreciated. Dorian
