> > i am actually not familiar with what oscar has said on this. can you share > or point me to the conversation thread?
One of the places was is this panel discussion<http://www.meetup.com/hadoopsf/events/141368262/>, but it doesn't look like there is a recording of it available, so I guess that's not too helpful.... On Mon, Oct 28, 2013 at 5:32 PM, Koert Kuipers <[email protected]> wrote: > no problem :) i am actually not familiar with what oscar has said on this. > can you share or point me to the conversation thread? > > it is my opinion based on the little experimenting i have done. but i am > willing to be convinced otherwise. > one the very first things i did when we started using spark is run jobs > with DISK_ONLY, and see if it could some of the jobs that map-reduce does > for us. however i ran into OOMs, presumably because spark makes assumptions > that some things should fit in memory. i have to admit i didn't try too > hard after the first OOMs. > > if spark were able to scale from the quick in-memory query to the > overnight disk-only giant batch query, i would love it! spark has a much > nicer api than map-red, and one could use a single set of algos for > everything from quick/realtime queries to giant batch jobs. as far as i am > concerned map-red would be done. our clusters of the future would be hdfs + > spark. > > > On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <[email protected]>wrote: > >> And I didn't mean to skip over you, Koert. I'm just more familiar with >> what Oscar said on the subject than with your opinion. >> >> >> >> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <[email protected]>wrote: >> >>> Hmmm... I was unaware of this concept that Spark is for medium to large >>>> datasets but not for very large datasets. >>> >>> >>> It is in the opinion of some at Twitter. That doesn't make it true or a >>> universally held opinion. >>> >>> >>> >>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <[email protected]>wrote: >>> >>>> Hmmm... I was unaware of this concept that Spark is for medium to large >>>> datasets but not for very large datasets. What size is very large? >>>> >>>> Can someone please elaborate on why this would be the case and what >>>> stops Spark, as it is today, to be successfully run on very large datasets? >>>> I'll appreciate it. >>>> >>>> I would think that Spark should be able to pull off Hadoop level >>>> throughput in worst case with DISK_ONLY caching. >>>> >>>> Thanks >>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <[email protected]> wrote: >>>> >>>>> i would say scaling (cascading + DSL for scala) offers similar >>>>> functionality to spark, and a similar syntax. >>>>> the main difference between spark and scalding is target jobs: >>>>> scalding is for long running jobs on very large data. the data is read >>>>> from and written to disk between steps. jobs run from minutes to days. >>>>> spark is for faster jobs on medium to large data. the data is >>>>> primarily held in memory. jobs run from a few seconds to a few hours. >>>>> although spark can work with data on disks it still makes assumptions that >>>>> data needs to fit in memory for certain steps (although less and less with >>>>> every release). spark also makes iterative designs much easier. >>>>> >>>>> i have found them both great to program in and complimentary. we use >>>>> scalding for overnight batch processes and spark for more realtime >>>>> processes. at this point i would trust scalding a lot more due to the >>>>> robustness of the stack, but spark is getting better every day. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <[email protected]> wrote: >>>>> >>>>>> Hi Philip, >>>>>> >>>>>> Cascading is relatively agnostic about the distributed topology >>>>>> underneath it, especially as of the 2.0 release over a year ago. There's >>>>>> been some discussion about writing a flow planner for Spark -- e.g., >>>>>> which >>>>>> would replace the Hadoop flow planner. Not sure if there's active work on >>>>>> that yet. >>>>>> >>>>>> There are a few commercial workflow abstraction layers (probably what >>>>>> was meant by "application layer" ?), in terms of the Cascading family >>>>>> (incl. Cascalog, Scalding), and also Actian's integration of >>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the >>>>>> Py data stack. >>>>>> >>>>>> Spark would not be at the same level of abstraction as Cascading >>>>>> (business logic, effectively); however, something like MLbase is >>>>>> ostensibly >>>>>> intended for that http://www.mlbase.org/ >>>>>> >>>>>> With respect to Spark, two other things to watch... One would >>>>>> definitely be the Py data stack and ability to integrate with PySpark, >>>>>> which is turning out to be very power abstraction -- quite close to a >>>>>> large >>>>>> segment of industry needs. The other project to watch, on the Scala >>>>>> side, is Summingbird and it's evolution at Twitter: >>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird >>>>>> >>>>>> Paco >>>>>> http://amazon.com/dp/1449358721/ >>>>>> >>>>>> >>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> My team is investigating a number of technologies in the Big Data >>>>>>> space. A team member recently got turned on to >>>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application >>>>>>> layer for orchestrating complex workflows/scenarios. He >>>>>>> asked me if Spark had an "application layer"? My initial reaction is >>>>>>> "no" >>>>>>> that Spark would not have a separate orchestration/application layer. >>>>>>> Instead, the core Spark API (along with Streaming) would compete >>>>>>> directly >>>>>>> with Cascading for this kind of functionality and that the two would not >>>>>>> likely be all that complementary. I realize that I am exposing my >>>>>>> ignorance here and could be way off. Is there anyone who knows a bit >>>>>>> about >>>>>>> both of these technologies who could speak to this in broad strokes? >>>>>>> >>>>>>> Thanks! >>>>>>> Philip >>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >
