Hi, Cloud Dataflow also has worker nodes and a master - though the master is part of the Cloud Dataflow service and runs on Google's internal servers. I believe all data distributed data processing tools use a similar architecture.
Some major differences I can point out quickly: - As you said, Spark uses in-memory caching of datasets. Dataflow doesn't do that, because its programming model is different (see below). - Dataflow separates the pipeline construction stage from execution - you construct the whole pipeline and give it to Dataflow, Dataflow optimizes the whole thing and runs it. Because of this, PCollection's are merely logical nodes in the execution plan. In Spark this is a lot more blurred - collections (RDDs) can be used directly. This allows the user program to interactively request their contents, making interactive and iterative computing possible (Dataflow currently has very young support of the former and non-existent of the latter) - however it comes at the expense of difficulty doing whole-program optimization / monitoring / analysis. In-memory caching plays a critical role in enabling this aspect of Spark, but would not be so useful in Dataflow. - Dataflow's sharding model is different. Both Spark and Dataflow split datasets into shards for parallel execution, but in case of Spark the set of shards is predetermined at the beginning of an operation and its execution model critically relies on this fact, whereas in Dataflow the execution model only relies on the fact "once all shards of a stage complete, the stage is done", so Dataflow can do liquid sharding (dynamic splitting of running shards). Liquid sharding, in turn, makes autoscaling possible, e.g. Dataflow can start running a stage with only a few shards and gradually subsplit them into thousands of shards running on hundreds of workers as it realizes that the stage is very large. - Dataflow's streaming engine is very different from Spark's, though I believe Spark has gotten closer. I'm not familiar enough with either to comment more, maybe someone else can. - Dataflow's shuffle engine is also very different from Spark's, and is encapsulated as a service (the Shuffle Service), which further improves ability to do autoscaling (much easier to do when no data is stored on the workers) and is faster because it can use fancy internal-only hardware and software not available on cloud workers. Disclaimer: I've worked on Dataflow but I didn't work on Spark, so the above is biased in favor of Dataflow. People with more knowledge of Spark should be able to balance this out with more of Spark's major capabilities. On Fri, Sep 27, 2019 at 11:42 PM dev wearebold <[email protected]> wrote: > Hello folks! > > I’m trying to get a deeper understanding of how Cloud Dataflow runs our > Beam programs. > > I worked with Spark a few months and I understood that you have some kind > of cluster topology with a driver program which creates the SparkContext, > some worker nodes and a cluster manager. Also, I know that Spark is very > fast via it’s in-memory computing. > > Is it the same case for Cloud Dataflow? What are the big differences > between them?
