Re: compare/contrast Spark with Cascading

Philip Ogren Mon, 28 Oct 2013 14:22:13 -0700

Hi Paco,

Thank you for the various links and thoughts. Yes - "workflowabstraction layer" is a better term for what I meant. I have twoquestions for you:

1) when you say "Cascading is relatively agnostic about the distributedtopology underneath it" I take that as a hedge that suggests that whileit could be possible to run Spark underneath Cascading this is notsomething commonly done or would necessarily be straightforward. Isthis an unfair reading between the lines - or isCascading-on-top-of-Spark an established technology stack that peopleare actually using?

2) Can you give an example of how Cascading is at a higher level ofabstraction than Spark? When I look at the landing page for Scalding(which runs on top of Cascading) and JCascalog (which claims to yetanother level of abstraction above Cascading) I see getting started codesnippets that look exactly like the sort of thing you do with Spark. Ican understand why this is a useful approach for a getting started pagebut it doesn't shed light on how these two technologies mightdifferentiate from Spark with respect to the abstraction layer theytarget. Any thoughts on this (or examples!) would be helpful to me.


Thanks,
Philip


On 10/28/2013 1:00 PM, Paco Nathan wrote:

Hi Philip,
Cascading is relatively agnostic about the distributed topologyunderneath it, especially as of the 2.0 release over a year ago.There's been some discussion about writing a flow planner for Spark --e.g., which would replace the Hadoop flow planner. Not sure if there'sactive work on that yet.
There are a few commercial workflow abstraction layers (probably whatwas meant by "application layer" ?), in terms of the Cascading family(incl. Cascalog, Scalding), and also Actian's integration ofHadoop/Knime/etc., and also the work by Continuum, ODG, and others inthe Py data stack.
Spark would not be at the same level of abstraction as Cascading(business logic, effectively); however, something like MLbase isostensibly intended for that http://www.mlbase.org/
With respect to Spark, two other things to watch... One woulddefinitely be the Py data stack and ability to integrate with PySpark,which is turning out to be very power abstraction -- quite close to alarge segment of industry needs. The other project to watch, on theScala side, is Summingbird and it's evolution at Twitter:https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
Paco
http://amazon.com/dp/1449358721/
On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren<[email protected] <mailto:[email protected]>> wrote:
    My team is investigating a number of technologies in the Big Data
    space.  A team member recently got turned on to Cascading
    <http://www.cascading.org/about-cascading/> as an application
    layer for orchestrating complex workflows/scenarios.  He asked me
    if Spark had an "application layer"?  My initial reaction is "no"
    that Spark would not have a separate orchestration/application
    layer.  Instead, the core Spark API (along with Streaming) would
    compete directly with Cascading for this kind of functionality and
    that the two would not likely be all that complementary.  I
    realize that I am exposing my ignorance here and could be way
    off.  Is there anyone who knows a bit about both of these
    technologies who could speak to this in broad strokes?

    Thanks!
    Philip

Re: compare/contrast Spark with Cascading

Reply via email to