RE: Apache Storm vs Apache Spark

Dan Mon, 09 Jun 2014 16:58:19 -0700

Where would Akka fit on the Storm/Spark spectrum?
ThanksDan

Date: Mon, 9 Jun 2014 15:48:49 -0700
Subject: Re: Apache Storm vs Apache Spark
From: [email protected]
To: [email protected]


Thanks Taylor. Storm seems more flexible in terms of its framework in that it 
provides key primitives, the onus is on the developers depending on their QOS 
needs how to fine tune it. On the other hand, looking at Lambda architecture, 
Storm only fulfills the speed layer while Spark could be batch/speed/serving 
(Spark SQL).  Based on the use cases and compromises one would like to make on 
throughput/latency/QOS, guess have to pick the right one.

My simple use case isa) I have stream of orders (keyed on customerid, source is 
socket)b) I filter for those orders that is from my high value customers (I 
have to make sure I have this list of high value customers available on all 
bolt tasks in memory for fast correlation/projection), so  customer id in 
streams correlated to customer id in the list and if the customer type is in 
platinum and gold
c) Count the orders/amount for last 5 minutes and group by products, customer 
type





On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <[email protected]> wrote:

The way I usually describe the difference is that Spark is a batch processing 
framework that also does micro-batching (Spark Streaming), while Storm is a 
stream processing framework that also does micro-batching (Trident). So 
architecturally they are very different, but have some similarity on the 
functional side.

With micro-batching you can achieve higher throughput at the cost of increased 
latency. With Spark this is unavoidable. With Storm you can use the core API 
(spouts and bolts) to do one-at-a-time processing to avoid the inherent latency 
overhead imposed by micro-batching. With Trident, you get state management out 
of the box, and sliding windows are supported as well.

In terms of adoption and production deployments, Storm has been around longer 
and there are a LOT of production deployments. I’m not aware of that many 
production Spark deployments, but I’d expect that to change over time.

In terms of performance, I can’t really point to any valid comparisons. When I 
say “valid” I mean open and independently verifiable. There is one study that 
I’m aware of that claims Spark streaming is insanely faster than Storm. The 
problem with that study is that none of the code or configurations used are 
publicly available (that I’m aware of). So without a way to independently 
verify those claims, I’d dismiss it as marketing fluff (the same goes for the 
IBM InfoStreams comparison). Storm is very tunable when it comes to 
performance, allowing it to be tuned to the use case at hand. However, it is 
also easy to cripple performance with the wrong config.

I can personally verify that it is possible to process 1.2+ million (relatively 
small) messages per second with a 10-15 node cluster — and that includes 
writing to HBase, and other components (I don’t have the hardware specs handy, 
but can probably dig them up).


- Taylor
 

On Jun 9, 2014, at 4:04 PM, Rajiv Onat <[email protected]> wrote:

Thanks. Not sure why you say it is different, from a stream processing use case 
perspective both seems to accomplish the same thing while the implementation 
may take different approaches. If I want to aggregate and do stats in Storm, I 
would have to microbatch the tuples at some level. e.g. count of orders in last 
1 minute, in Storm I have to write code to for sliding windows and state 
management, while Spark seems to provide operators to accomplish that. Tuple 
level operations such as enrichment, filters etc.. seems also doable in both.



On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <[email protected]> wrote:



They are different.
Storm allows right now processing of tuples.  Spark streaming requires micro 
batching (which may be a really short time).  Spark streaming allows 
checkpointing of partial results in the stream supported by the framework.  
Storm says you should roll your own or use trident.




Applications that fit one like a glove are likely to bind a bit on the other.




On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <[email protected]> wrote:


I'm trying to figure out whether these are competitive technologies for stream 
processing or complimentary? From the initial read, from a stream processing 
capabilities both provides a framework for scaling while Spark has window 
constructs, Apache Spark has a Spark Streaming and promises one platform for 
batch, interactive and stream processing. 





Any comments or thoughts?

RE: Apache Storm vs Apache Spark

Reply via email to