Re: Thoughts on Cloud Dataflow

Gabriel Reid Sun, 25 Jan 2015 16:35:33 -0800

Thanks for the excellent explanation of all that, much appreciated!

- Gabriel


On Sun, Jan 25, 2015 at 12:51 PM, Josh Wills <[email protected]> wrote:
> You all may have seen the blog post I wrote about the Spark-based
> implementation of Google's Cloud Dataflow (CDF) API:
>
> http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/
>
> There are two very natural questions I'll try to answer in the rest of this
> email:
>
> 1) Why did I do this?
> 2) What does it mean for the Crunch project?
>
> On the first question, I personally enjoy reading and thinking about APIs
> for data processing pipelines. It's a weird hobby, but it's mine, and when
> Google came to me and said that I could get an early look at the successor
> to FlumeJava, I was more than happy to help them out and see how hard it
> would be to run what they had on top of Spark.
>
> There are two ideas in CDF that I really like: the PTransform abstraction
> and the unification of arbitrary batch and stream processing into a single
> data model and API. I'll explain them in terms of the Spark/Crunch/Scalding
> (SCS) programming model, which are all essentially the same: an object that
> represents a distributed collection that can be transformed into other
> distributed collections by function application.
>
> 1) PTransforms: This is a pattern I see in SCS projects all of the time
> (writing this in Scala to save on keystrokes):
>
> def count[S](input: PCollection[S]): PTable[S, Long] = {
>   input.map(s => (s, 1L)).groupByKey().sum()
> }
>
> org.apache.crunch.lib is essentially a collection of functions that look
> like this: they perform a complex transformation via a series of primitive
> transforms (pDo, GBK, union, and combineValues.) This makes it easy for
> developers to write pipelines using higher-level concepts, but the SCS
> planner/executors only represent pipelines in terms of the primitive
> transforms, so we lose this higher-level information during pipeline
> execution, which can make it difficult to figure out the mapping between the
> logical pipeline and the actual MR/Spark jobs that are executing the code
> when it comes time to diagnose errors and/or performance bottlenecks.
>
> In CDF, you can capture the higher-level structure of your pipelines via
> PTransforms, which look like this:
>
> https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/PTransform
>
> So if I have a PTransform that is made up of other PTransforms (which is how
> the entire CDF transform library is structured), the CDF planner/executor
> knows about that hierarchical structure of the pipeline, and you can examine
> the runtime performance of your pipeline in terms of those higher level
> concepts-- it's sort of like when you're using a JVM profiler like YourKit
> and you can navigate through the function call hierarchy to find
> bottlenecks. It's a very cool idea and I am insanely jealous I didn't think
> of first. :)
>
> 2) The unification of batch and streaming: this is more interesting, b/c CDF
> is closer to Jay Kreps' kappa architecture than Nathan Marz's lambda
> architecture, in the sense that CDF streaming is capable of doing everything
> CDF batch can do and offers the same fault tolerance guarantees (including
> only-once element processing). However, the batch engine is still used for
> pure batch pipelines b/c it is faster and/or less resource intensive than
> the streaming engine.
>
> CDF streaming is built on top of Millwheel, which doesn't have an exact
> open-source analogue at this point, and is capable of doing some things that
> I don't think can be done in Spark Streaming, Storm or Samza b/c Millwheel
> assumes you have access to a KV store like Bigtable/HBase for maintaining
> per-key state and can make a distinction between when an event happened and
> when the event arrived for processing by the streaming engine. In
> particular, I believe it's possible for CDF to create real-time sessions
> where the sessionization criteria involves a pre-defined gap between events
> for the same key (i.e., end one session and start another if more than N
> seconds goes by and I don't see any new events for a given key.) The
> distinction between event-time and arrival-time also has some implications
> for handling late-arriving events that require CDF to have some semantics
> around windowing that don't appear to exist in the other stream processing
> engines. I'm not sure of the best strategy here, so I was thinking of
> talking to folks who work on the various streaming engines to get their take
> on how to support this functionality.
>
> In terms of what CDF means for Crunch: in the short-term, I don't think it
> means anything yet. There are still lots of MapReduce-based data pipelines
> in the world that should be written in Crunch instead, esp. as people start
> moving to the next generation of execution engines like Spark. Spark isn't
> quite to the point where it can reliably handle multi-terabyte data
> pipelines the way MapReduce can, but it's clearly getting better quickly,
> and the volume and complexity it can handle is going up. I think that Crunch
> is the best way to help developers make the transition from MR to Spark as
> smoothly as possible.
>
> At some point, we could consider making the current version of Crunch 1.0,
> and working on a unified batch/streaming execution engine along the lines of
> the CDF API as Crunch 2.0. I believe that the unified batch/streaming
> pipeline model is the next really interesting challenge in pipelines, and
> the Crunch community would be my favorite place to work on it.
>
> Thanks for putting up with the long email,
> Josh

Re: Thoughts on Cloud Dataflow

Reply via email to