Re: Apache Storm vs Apache Spark

Ted Dunning Mon, 09 Jun 2014 17:23:25 -0700

In left field.


On Mon, Jun 9, 2014 at 4:57 PM, Dan <[email protected]> wrote:

> Where would Akka fit on the Storm/Spark spectrum?
>
> Thanks
> Dan
>
> ------------------------------
> Date: Mon, 9 Jun 2014 15:48:49 -0700
> Subject: Re: Apache Storm vs Apache Spark
> From: [email protected]
> To: [email protected]
>
> Thanks Taylor. Storm seems more flexible in terms of its framework in that
> it provides key primitives, the onus is on the developers depending on
> their QOS needs how to fine tune it. On the other hand, looking at Lambda
> architecture, Storm only fulfills the speed layer while Spark could be
> batch/speed/serving (Spark SQL).  Based on the use cases and compromises
> one would like to make on throughput/latency/QOS, guess have to pick the
> right one.
>
> My simple use case is
> a) I have stream of orders (keyed on customerid, source is socket)
> b) I filter for those orders that is from my high value customers (I have
> to make sure I have this list of high value customers available on all bolt
> tasks in memory for fast correlation/projection), so  customer id in
> streams correlated to customer id in the list and if the customer type is
> in platinum and gold
> c) Count the orders/amount for last 5 minutes and group by products,
> customer type
>
>
>
>
>
> On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz <[email protected]> wrote:
>
> The way I usually describe the difference is that Spark is a batch
> processing framework that also does micro-batching (Spark Streaming), while
> Storm is a stream processing framework that also does micro-batching
> (Trident). So architecturally they are very different, but have some
> similarity on the functional side.
>
> With micro-batching you can achieve higher throughput at the cost of
> increased latency. With Spark this is unavoidable. With Storm you can use
> the core API (spouts and bolts) to do one-at-a-time processing to avoid the
> inherent latency overhead imposed by micro-batching. With Trident, you get
> state management out of the box, and sliding windows are supported as well.
>
> In terms of adoption and production deployments, Storm has been around
> longer and there are a LOT of production deployments. I’m not aware of that
> many production Spark deployments, but I’d expect that to change over time.
>
> In terms of performance, I can’t really point to any valid comparisons.
> When I say “valid” I mean open and independently verifiable. There is one
> study that I’m aware of that claims Spark streaming is insanely faster than
> Storm. The problem with that study is that none of the code or
> configurations used are publicly available (that I’m aware of). So without
> a way to independently verify those claims, I’d dismiss it as marketing
> fluff (the same goes for the IBM InfoStreams comparison). Storm is very
> tunable when it comes to performance, allowing it to be tuned to the use
> case at hand. However, it is also easy to cripple performance with the
> wrong config.
>
> I can personally verify that it is possible to process 1.2+ million
> (relatively small) messages per second with a 10-15 node cluster — and that
> includes writing to HBase, and other components (I don’t have the hardware
> specs handy, but can probably dig them up).
>
>
> - Taylor
>
>
>
>
> On Jun 9, 2014, at 4:04 PM, Rajiv Onat <[email protected]> wrote:
>
> Thanks. Not sure why you say it is different, from a stream processing use
> case perspective both seems to accomplish the same thing while the
> implementation may take different approaches. If I want to aggregate and do
> stats in Storm, I would have to microbatch the tuples at some level. e.g.
> count of orders in last 1 minute, in Storm I have to write code to for
> sliding windows and state management, while Spark seems to provide
> operators to accomplish that. Tuple level operations such as enrichment,
> filters etc.. seems also doable in both.
>
>
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning <[email protected]>
> wrote:
>
>
> They are different.
>
> Storm allows right now processing of tuples.  Spark streaming requires
> micro batching (which may be a really short time).  Spark streaming allows
> checkpointing of partial results in the stream supported by the framework.
>  Storm says you should roll your own or use trident.
>
> Applications that fit one like a glove are likely to bind a bit on the
> other.
>
>
>
>
> On Mon, Jun 9, 2014 at 12:16 PM, Rajiv Onat <[email protected]> wrote:
>
> I'm trying to figure out whether these are competitive technologies for
> stream processing or complimentary? From the initial read, from a stream
> processing capabilities both provides a framework for scaling while Spark
> has window constructs, Apache Spark has a Spark Streaming and promises one
> platform for batch, interactive and stream processing.
>
> Any comments or thoughts?
>
>
>
>
>
>

Re: Apache Storm vs Apache Spark

Reply via email to