> > 1) when you say "Cascading is relatively agnostic about the distributed > topology underneath it" I take that as a hedge that suggests that while it > could be possible to run Spark underneath Cascading this is not something > commonly done or would necessarily be straightforward. Is this an unfair > reading between the lines - or is Cascading-on-top-of-Spark an established > technology stack that people are actually using?
Not yet established technology AFAIK, but I have heard Oscar mention the possibility of Scalding in the future being able to shift gears, as it were -- handling flows against very large datasets using Hadoop MR, but then transparently shifting to using Spark on the backend once a relevant subset of the data has been reduced/extracted that is small enough to fit into the aggregate memory of an available Spark cluster. I think that Paco's point isn't that such things are easily being done right now so much as it is that the underlying architecture of Cascading/Scalding is generic or abstract enough that such things are quite conceivable. On Mon, Oct 28, 2013 at 2:20 PM, Philip Ogren <[email protected]>wrote: > Hi Paco, > > Thank you for the various links and thoughts. Yes - "workflow abstraction > layer" is a better term for what I meant. I have two questions for you: > > 1) when you say "Cascading is relatively agnostic about the distributed > topology underneath it" I take that as a hedge that suggests that while it > could be possible to run Spark underneath Cascading this is not something > commonly done or would necessarily be straightforward. Is this an unfair > reading between the lines - or is Cascading-on-top-of-Spark an established > technology stack that people are actually using? > > 2) Can you give an example of how Cascading is at a higher level of > abstraction than Spark? When I look at the landing page for Scalding > (which runs on top of Cascading) and JCascalog (which claims to yet another > level of abstraction above Cascading) I see getting started code snippets > that look exactly like the sort of thing you do with Spark. I can > understand why this is a useful approach for a getting started page but it > doesn't shed light on how these two technologies might differentiate from > Spark with respect to the abstraction layer they target. Any thoughts on > this (or examples!) would be helpful to me. > > Thanks, > Philip > > > > On 10/28/2013 1:00 PM, Paco Nathan wrote: > > Hi Philip, > > Cascading is relatively agnostic about the distributed topology > underneath it, especially as of the 2.0 release over a year ago. There's > been some discussion about writing a flow planner for Spark -- e.g., which > would replace the Hadoop flow planner. Not sure if there's active work on > that yet. > > There are a few commercial workflow abstraction layers (probably what > was meant by "application layer" ?), in terms of the Cascading family > (incl. Cascalog, Scalding), and also Actian's integration of > Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the > Py data stack. > > Spark would not be at the same level of abstraction as Cascading > (business logic, effectively); however, something like MLbase is ostensibly > intended for that http://www.mlbase.org/ > > With respect to Spark, two other things to watch... One would definitely > be the Py data stack and ability to integrate with PySpark, which is > turning out to be very power abstraction -- quite close to a large segment > of industry needs. The other project to watch, on the Scala side, is > Summingbird and it's evolution at Twitter: > https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird > > Paco > http://amazon.com/dp/1449358721/ > > > On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <[email protected]>wrote: > >> >> My team is investigating a number of technologies in the Big Data space. >> A team member recently got turned on to >> Cascading<http://www.cascading.org/about-cascading/>as an application layer >> for orchestrating complex workflows/scenarios. He >> asked me if Spark had an "application layer"? My initial reaction is "no" >> that Spark would not have a separate orchestration/application layer. >> Instead, the core Spark API (along with Streaming) would compete directly >> with Cascading for this kind of functionality and that the two would not >> likely be all that complementary. I realize that I am exposing my >> ignorance here and could be way off. Is there anyone who knows a bit about >> both of these technologies who could speak to this in broad strokes? >> >> Thanks! >> Philip >> >> > >
