Hello thanks for this explanation =) J
On 2019/09/30 16:59:54, Eugene Kirpichov <[email protected]> wrote: > Hi,> > > Cloud Dataflow also has worker nodes and a master - though the master is> > part of the Cloud Dataflow service and runs on Google's internal servers. I> > believe all data distributed data processing tools use a similar> > architecture.> > > Some major differences I can point out quickly:> > - As you said, Spark uses in-memory caching of datasets. Dataflow doesn't> > do that, because its programming model is different (see below).> > - Dataflow separates the pipeline construction stage from execution - you> > construct the whole pipeline and give it to Dataflow, Dataflow optimizes> > the whole thing and runs it. Because of this, PCollection's are merely> > logical nodes in the execution plan. In Spark this is a lot more blurred -> > collections (RDDs) can be used directly. This allows the user program to> > interactively request their contents, making interactive and iterative> > computing possible (Dataflow currently has very young support of the former> > and non-existent of the latter) - however it comes at the expense of> > difficulty doing whole-program optimization / monitoring / analysis.> > In-memory caching plays a critical role in enabling this aspect of Spark,> > but would not be so useful in Dataflow.> > - Dataflow's sharding model is different. Both Spark and Dataflow split> > datasets into shards for parallel execution, but in case of Spark the set> > of shards is predetermined at the beginning of an operation and its> > execution model critically relies on this fact, whereas in Dataflow the> > execution model only relies on the fact "once all shards of a stage> > complete, the stage is done", so Dataflow can do liquid sharding (dynamic> > splitting of running shards). Liquid sharding, in turn, makes autoscaling> > possible, e.g. Dataflow can start running a stage with only a few shards> > and gradually subsplit them into thousands of shards running on hundreds of> > workers as it realizes that the stage is very large.> > - Dataflow's streaming engine is very different from Spark's, though I> > believe Spark has gotten closer. I'm not familiar enough with either to> > comment more, maybe someone else can.> > - Dataflow's shuffle engine is also very different from Spark's, and is> > encapsulated as a service (the Shuffle Service), which further improves> > ability to do autoscaling (much easier to do when no data is stored on the> > workers) and is faster because it can use fancy internal-only hardware and> > software not available on cloud workers.> > > Disclaimer: I've worked on Dataflow but I didn't work on Spark, so the> > above is biased in favor of Dataflow. People with more knowledge of Spark> > should be able to balance this out with more of Spark's major capabilities.> > > On Fri, Sep 27, 2019 at 11:42 PM dev wearebold <[email protected]>> > wrote:> > > > Hello folks!> > >> > > I’m trying to get a deeper understanding of how Cloud Dataflow runs our> > > Beam programs.> > >> > > I worked with Spark a few months and I understood that you have some kind> > > of cluster topology with a driver program which creates the SparkContext,> > > some worker nodes and a cluster manager. Also, I know that Spark is very> > > fast via it’s in-memory computing.> > >> > > Is it the same case for Cloud Dataflow? What are the big differences> > > between them?> >
