Thanks for the excellent explanation of all that, much appreciated! - Gabriel
On Sun, Jan 25, 2015 at 12:51 PM, Josh Wills <[email protected]> wrote: > You all may have seen the blog post I wrote about the Spark-based > implementation of Google's Cloud Dataflow (CDF) API: > > http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/ > > There are two very natural questions I'll try to answer in the rest of this > email: > > 1) Why did I do this? > 2) What does it mean for the Crunch project? > > On the first question, I personally enjoy reading and thinking about APIs > for data processing pipelines. It's a weird hobby, but it's mine, and when > Google came to me and said that I could get an early look at the successor > to FlumeJava, I was more than happy to help them out and see how hard it > would be to run what they had on top of Spark. > > There are two ideas in CDF that I really like: the PTransform abstraction > and the unification of arbitrary batch and stream processing into a single > data model and API. I'll explain them in terms of the Spark/Crunch/Scalding > (SCS) programming model, which are all essentially the same: an object that > represents a distributed collection that can be transformed into other > distributed collections by function application. > > 1) PTransforms: This is a pattern I see in SCS projects all of the time > (writing this in Scala to save on keystrokes): > > def count[S](input: PCollection[S]): PTable[S, Long] = { > input.map(s => (s, 1L)).groupByKey().sum() > } > > org.apache.crunch.lib is essentially a collection of functions that look > like this: they perform a complex transformation via a series of primitive > transforms (pDo, GBK, union, and combineValues.) This makes it easy for > developers to write pipelines using higher-level concepts, but the SCS > planner/executors only represent pipelines in terms of the primitive > transforms, so we lose this higher-level information during pipeline > execution, which can make it difficult to figure out the mapping between the > logical pipeline and the actual MR/Spark jobs that are executing the code > when it comes time to diagnose errors and/or performance bottlenecks. > > In CDF, you can capture the higher-level structure of your pipelines via > PTransforms, which look like this: > > https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/PTransform > > So if I have a PTransform that is made up of other PTransforms (which is how > the entire CDF transform library is structured), the CDF planner/executor > knows about that hierarchical structure of the pipeline, and you can examine > the runtime performance of your pipeline in terms of those higher level > concepts-- it's sort of like when you're using a JVM profiler like YourKit > and you can navigate through the function call hierarchy to find > bottlenecks. It's a very cool idea and I am insanely jealous I didn't think > of first. :) > > 2) The unification of batch and streaming: this is more interesting, b/c CDF > is closer to Jay Kreps' kappa architecture than Nathan Marz's lambda > architecture, in the sense that CDF streaming is capable of doing everything > CDF batch can do and offers the same fault tolerance guarantees (including > only-once element processing). However, the batch engine is still used for > pure batch pipelines b/c it is faster and/or less resource intensive than > the streaming engine. > > CDF streaming is built on top of Millwheel, which doesn't have an exact > open-source analogue at this point, and is capable of doing some things that > I don't think can be done in Spark Streaming, Storm or Samza b/c Millwheel > assumes you have access to a KV store like Bigtable/HBase for maintaining > per-key state and can make a distinction between when an event happened and > when the event arrived for processing by the streaming engine. In > particular, I believe it's possible for CDF to create real-time sessions > where the sessionization criteria involves a pre-defined gap between events > for the same key (i.e., end one session and start another if more than N > seconds goes by and I don't see any new events for a given key.) The > distinction between event-time and arrival-time also has some implications > for handling late-arriving events that require CDF to have some semantics > around windowing that don't appear to exist in the other stream processing > engines. I'm not sure of the best strategy here, so I was thinking of > talking to folks who work on the various streaming engines to get their take > on how to support this functionality. > > In terms of what CDF means for Crunch: in the short-term, I don't think it > means anything yet. There are still lots of MapReduce-based data pipelines > in the world that should be written in Crunch instead, esp. as people start > moving to the next generation of execution engines like Spark. Spark isn't > quite to the point where it can reliably handle multi-terabyte data > pipelines the way MapReduce can, but it's clearly getting better quickly, > and the volume and complexity it can handle is going up. I think that Crunch > is the best way to help developers make the transition from MR to Spark as > smoothly as possible. > > At some point, we could consider making the current version of Crunch 1.0, > and working on a unified batch/streaming execution engine along the lines of > the CDF API as Crunch 2.0. I believe that the unified batch/streaming > pipeline model is the next really interesting challenge in pipelines, and > the Crunch community would be my favorite place to work on it. > > Thanks for putting up with the long email, > Josh
