Where would Akka fit on the Storm/Spark spectrum? ThanksDan Date: Mon, 9 Jun 2014 15:48:49 -0700 Subject: Re: Apache Storm vs Apache Spark From: [email protected] To: [email protected]
Thanks Taylor. Storm seems more flexible in terms of its framework in that it provides key primitives, the onus is on the developers depending on their QOS needs how to fine tune it. On the other hand, looking at Lambda architecture, Storm only fulfills the speed layer while Spark could be batch/speed/serving (Spark SQL). Based on the use cases and compromises one would like to make on throughput/latency/QOS, guess have to pick the right one. My simple use case isa) I have stream of orders (keyed on customerid, source is socket)b) I filter for those orders that is from my high value customers (I have to make sure I have this list of high value customers available on all bolt tasks in memory for fast correlation/projection), so customer id in streams correlated to customer id in the list and if the customer type is in platinum and gold c) Count the orders/amount for last 5 minutes and group by products, customer type On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <[email protected]> wrote: The way I usually describe the difference is that Spark is a batch processing framework that also does micro-batching (Spark Streaming), while Storm is a stream processing framework that also does micro-batching (Trident). So architecturally they are very different, but have some similarity on the functional side. With micro-batching you can achieve higher throughput at the cost of increased latency. With Spark this is unavoidable. With Storm you can use the core API (spouts and bolts) to do one-at-a-time processing to avoid the inherent latency overhead imposed by micro-batching. With Trident, you get state management out of the box, and sliding windows are supported as well. In terms of adoption and production deployments, Storm has been around longer and there are a LOT of production deployments. I’m not aware of that many production Spark deployments, but I’d expect that to change over time. In terms of performance, I can’t really point to any valid comparisons. When I say “valid” I mean open and independently verifiable. There is one study that I’m aware of that claims Spark streaming is insanely faster than Storm. The problem with that study is that none of the code or configurations used are publicly available (that I’m aware of). So without a way to independently verify those claims, I’d dismiss it as marketing fluff (the same goes for the IBM InfoStreams comparison). Storm is very tunable when it comes to performance, allowing it to be tuned to the use case at hand. However, it is also easy to cripple performance with the wrong config. I can personally verify that it is possible to process 1.2+ million (relatively small) messages per second with a 10-15 node cluster — and that includes writing to HBase, and other components (I don’t have the hardware specs handy, but can probably dig them up). - Taylor On Jun 9, 2014, at 4:04 PM, Rajiv Onat <[email protected]> wrote: Thanks. Not sure why you say it is different, from a stream processing use case perspective both seems to accomplish the same thing while the implementation may take different approaches. If I want to aggregate and do stats in Storm, I would have to microbatch the tuples at some level. e.g. count of orders in last 1 minute, in Storm I have to write code to for sliding windows and state management, while Spark seems to provide operators to accomplish that. Tuple level operations such as enrichment, filters etc.. seems also doable in both. On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <[email protected]> wrote: They are different. Storm allows right now processing of tuples. Spark streaming requires micro batching (which may be a really short time). Spark streaming allows checkpointing of partial results in the stream supported by the framework. Storm says you should roll your own or use trident. Applications that fit one like a glove are likely to bind a bit on the other. On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <[email protected]> wrote: I'm trying to figure out whether these are competitive technologies for stream processing or complimentary? From the initial read, from a stream processing capabilities both provides a framework for scaling while Spark has window constructs, Apache Spark has a Spark Streaming and promises one platform for batch, interactive and stream processing. Any comments or thoughts?
