Hey Howard,

Great to hear that you're looking at Spark Streaming!

> We have some in house real time streaming jobs written for Storm and want to 
> see the possibility to migrate to Spark Streaming in the future as our team 
> all think Spark is a very promising technologies (one platform to execute 
> both realtime & interactive jobs) and with excellent documentations.
> 
> 1. If we focus on the streaming capabilities, what are the main pros/cons at 
> the current moment, is Spark streaming suitable for production use now?

Spark Streaming provides all the facilities for continuous use, but currently 
master fault tolerance takes a bit more manual setup (in particular you have to 
manually restart your app from a checkpoint if the master crashes). We plan to 
improve this later. Several groups are using it in production though as far as 
I know, so it's worth a try as long as you read about this stuff.

> 2. In term of message reliability and transaction support, I assume both need 
> to rely on zookeeper, right?

As far as I know neither of them uses ZooKeeper for message reliability -- they 
implement this within the application, and maybe use ZooKeeper for leader 
election. Spark Streaming is designed to compute reliably (everything has 
"exactly-once" semantics by default) and does this using a mechanism called 
"discretized streams" 
(http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf). Storm 
only provides "at-least-once" delivery by default and does not provide fault 
tolerance for state, though Trident (built on top) gives you that and 
exactly-once semantics at the cost of using a database to do transactions. 
Storm also uses ZooKeeper for automatic leader election while Spark requires 
the manual setup of master restart as described above.

> 3. In Storm, we are using Topology/Spout/Bolt as the data model, how to 
> translate them to Spark streaming if we want to rewrite our system? Are there 
> any migration guide?

This is probably the biggest difference -- Spark Streaming only works in terms 
of high-level operations, such as map() and groupBy(), and doesn't expose a 
lower-level "nodes exchanging messages" model. You can probably take the code 
you use within a bolt and use it as a map function or whatever the right 
operation is in Spark or the way you want to combine data.

> 4. Can Spark do distributed RPC like Storm?

Since Spark Streaming is just running on top of Spark, you can actually run 
code to grab the state of the computation any time as an RDD and then run Spark 
queries on it. So again while it's not exactly the same as the thing called 
distributed RPC in Storm, it is a way to do arbitrary parallel computations on 
your data, and it comes with a high-level API.

Matei

Reply via email to