Hi,
I'm fairly new to Crunch and my motivation to investigate crunch is to have a portable higher-level Java API to construct data pipelines. Reading the docs I found this: Minimal abstractions. Crunch pipelines provide a thin veneer on top of MapReduce. Developers have access to low-level MapReduce APIs whenever they need them. This mimimalism also means that Crunch is extremely fast, only slightly slower than a hand-tuned pipeline developed with the MapReduce APIs, and the community is working on making it faster all the time. That said, one of the goals of the project is portability, and the abstractions that Crunch provides are designed to ease the transition from Hadoop 1.0 to Hadoop 2.0 and to provide transparent support for future data processing frameworks that run on Hadoop, including Apache Spark<http://spark.incubator.apache.org/> and Apache Tez<http://tez.incubator.apache.org/>. This is exactly what I'm looking for. However I'm also curious to know if its possible to use Spark APIs to optimize performance. Does anyone have this "use-case" and can share some experiences on how you went about mixing Crunch and Spark APIs? Thanks, Shiv
