And I didn't mean to skip over you, Koert. I'm just more familiar with what Oscar said on the subject than with your opinion.
On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <[email protected]>wrote: > Hmmm... I was unaware of this concept that Spark is for medium to large >> datasets but not for very large datasets. > > > It is in the opinion of some at Twitter. That doesn't make it true or a > universally held opinion. > > > > On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <[email protected]>wrote: > >> Hmmm... I was unaware of this concept that Spark is for medium to large >> datasets but not for very large datasets. What size is very large? >> >> Can someone please elaborate on why this would be the case and what stops >> Spark, as it is today, to be successfully run on very large datasets? I'll >> appreciate it. >> >> I would think that Spark should be able to pull off Hadoop level >> throughput in worst case with DISK_ONLY caching. >> >> Thanks >> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <[email protected]> wrote: >> >>> i would say scaling (cascading + DSL for scala) offers similar >>> functionality to spark, and a similar syntax. >>> the main difference between spark and scalding is target jobs: >>> scalding is for long running jobs on very large data. the data is read >>> from and written to disk between steps. jobs run from minutes to days. >>> spark is for faster jobs on medium to large data. the data is primarily >>> held in memory. jobs run from a few seconds to a few hours. although spark >>> can work with data on disks it still makes assumptions that data needs to >>> fit in memory for certain steps (although less and less with every >>> release). spark also makes iterative designs much easier. >>> >>> i have found them both great to program in and complimentary. we use >>> scalding for overnight batch processes and spark for more realtime >>> processes. at this point i would trust scalding a lot more due to the >>> robustness of the stack, but spark is getting better every day. >>> >>> >>> >>> >>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <[email protected]> wrote: >>> >>>> Hi Philip, >>>> >>>> Cascading is relatively agnostic about the distributed topology >>>> underneath it, especially as of the 2.0 release over a year ago. There's >>>> been some discussion about writing a flow planner for Spark -- e.g., which >>>> would replace the Hadoop flow planner. Not sure if there's active work on >>>> that yet. >>>> >>>> There are a few commercial workflow abstraction layers (probably what >>>> was meant by "application layer" ?), in terms of the Cascading family >>>> (incl. Cascalog, Scalding), and also Actian's integration of >>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the >>>> Py data stack. >>>> >>>> Spark would not be at the same level of abstraction as Cascading >>>> (business logic, effectively); however, something like MLbase is ostensibly >>>> intended for that http://www.mlbase.org/ >>>> >>>> With respect to Spark, two other things to watch... One would >>>> definitely be the Py data stack and ability to integrate with PySpark, >>>> which is turning out to be very power abstraction -- quite close to a large >>>> segment of industry needs. The other project to watch, on the Scala >>>> side, is Summingbird and it's evolution at Twitter: >>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird >>>> >>>> Paco >>>> http://amazon.com/dp/1449358721/ >>>> >>>> >>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <[email protected] >>>> > wrote: >>>> >>>>> >>>>> My team is investigating a number of technologies in the Big Data >>>>> space. A team member recently got turned on to >>>>> Cascading<http://www.cascading.org/about-cascading/>as an application >>>>> layer for orchestrating complex workflows/scenarios. He >>>>> asked me if Spark had an "application layer"? My initial reaction is "no" >>>>> that Spark would not have a separate orchestration/application layer. >>>>> Instead, the core Spark API (along with Streaming) would compete directly >>>>> with Cascading for this kind of functionality and that the two would not >>>>> likely be all that complementary. I realize that I am exposing my >>>>> ignorance here and could be way off. Is there anyone who knows a bit >>>>> about >>>>> both of these technologies who could speak to this in broad strokes? >>>>> >>>>> Thanks! >>>>> Philip >>>>> >>>>> >>>> >>> >
