Hey Koert, Can you give me steps to reproduce this ?
On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers <[email protected]> wrote: > Matei, > We have some jobs where even the input for a single key in a groupBy would > not fit in the the tasks memory. We rely on mapred to stream from disk to > disk as it reduces. > I think spark should be able to handle that situation to truly be able to > claim it can replace map-red (or not?). > Best, Koert > > > On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia <[email protected]>wrote: > >> FWIW, the only thing that Spark expects to fit in memory if you use >> DISK_ONLY caching is the input to each reduce task. Those currently don't >> spill to disk. The solution if datasets are large is to add more reduce >> tasks, whereas Hadoop would run along with a small number of tasks that do >> lots of disk IO. But this is something we will likely change soon. Other >> than that, everything runs in a streaming fashion and there's no need for >> the data to fit in memory. Our goal is certainly to work on any size >> datasets, and some of our current users are explicitly using Spark to >> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s >> presentation from http://ampcamp.berkeley.edu/3/). If you run into >> trouble with these, let us know, since it is an explicit goal of the >> project to support it. >> >> Matei >> >> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <[email protected]> wrote: >> >> no problem :) i am actually not familiar with what oscar has said on >> this. can you share or point me to the conversation thread? >> >> it is my opinion based on the little experimenting i have done. but i am >> willing to be convinced otherwise. >> one the very first things i did when we started using spark is run jobs >> with DISK_ONLY, and see if it could some of the jobs that map-reduce does >> for us. however i ran into OOMs, presumably because spark makes assumptions >> that some things should fit in memory. i have to admit i didn't try too >> hard after the first OOMs. >> >> if spark were able to scale from the quick in-memory query to the >> overnight disk-only giant batch query, i would love it! spark has a much >> nicer api than map-red, and one could use a single set of algos for >> everything from quick/realtime queries to giant batch jobs. as far as i am >> concerned map-red would be done. our clusters of the future would be hdfs + >> spark. >> >> >> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <[email protected]>wrote: >> >>> And I didn't mean to skip over you, Koert. I'm just more familiar with >>> what Oscar said on the subject than with your opinion. >>> >>> >>> >>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra >>> <[email protected]>wrote: >>> >>>> Hmmm... I was unaware of this concept that Spark is for medium to large >>>>> datasets but not for very large datasets. >>>> >>>> >>>> It is in the opinion of some at Twitter. That doesn't make it true or >>>> a universally held opinion. >>>> >>>> >>>> >>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <[email protected]>wrote: >>>> >>>>> Hmmm... I was unaware of this concept that Spark is for medium to >>>>> large datasets but not for very large datasets. What size is very large? >>>>> >>>>> Can someone please elaborate on why this would be the case and what >>>>> stops Spark, as it is today, to be successfully run on very large >>>>> datasets? >>>>> I'll appreciate it. >>>>> >>>>> I would think that Spark should be able to pull off Hadoop level >>>>> throughput in worst case with DISK_ONLY caching. >>>>> >>>>> Thanks >>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <[email protected]> wrote: >>>>> >>>>>> i would say scaling (cascading + DSL for scala) offers similar >>>>>> functionality to spark, and a similar syntax. >>>>>> the main difference between spark and scalding is target jobs: >>>>>> scalding is for long running jobs on very large data. the data is >>>>>> read from and written to disk between steps. jobs run from minutes to >>>>>> days. >>>>>> spark is for faster jobs on medium to large data. the data is >>>>>> primarily held in memory. jobs run from a few seconds to a few hours. >>>>>> although spark can work with data on disks it still makes assumptions >>>>>> that >>>>>> data needs to fit in memory for certain steps (although less and less >>>>>> with >>>>>> every release). spark also makes iterative designs much easier. >>>>>> >>>>>> i have found them both great to program in and complimentary. we use >>>>>> scalding for overnight batch processes and spark for more realtime >>>>>> processes. at this point i would trust scalding a lot more due to the >>>>>> robustness of the stack, but spark is getting better every day. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <[email protected]>wrote: >>>>>> >>>>>>> Hi Philip, >>>>>>> >>>>>>> Cascading is relatively agnostic about the distributed topology >>>>>>> underneath it, especially as of the 2.0 release over a year ago. There's >>>>>>> been some discussion about writing a flow planner for Spark -- e.g., >>>>>>> which >>>>>>> would replace the Hadoop flow planner. Not sure if there's active work >>>>>>> on >>>>>>> that yet. >>>>>>> >>>>>>> There are a few commercial workflow abstraction layers (probably >>>>>>> what was meant by "application layer" ?), in terms of the Cascading >>>>>>> family >>>>>>> (incl. Cascalog, Scalding), and also Actian's integration of >>>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in >>>>>>> the >>>>>>> Py data stack. >>>>>>> >>>>>>> Spark would not be at the same level of abstraction as Cascading >>>>>>> (business logic, effectively); however, something like MLbase is >>>>>>> ostensibly >>>>>>> intended for that http://www.mlbase.org/ >>>>>>> >>>>>>> With respect to Spark, two other things to watch... One would >>>>>>> definitely be the Py data stack and ability to integrate with PySpark, >>>>>>> which is turning out to be very power abstraction -- quite close to a >>>>>>> large >>>>>>> segment of industry needs. The other project to watch, on the >>>>>>> Scala side, is Summingbird and it's evolution at Twitter: >>>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird >>>>>>> >>>>>>> Paco >>>>>>> http://amazon.com/dp/1449358721/ >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> My team is investigating a number of technologies in the Big Data >>>>>>>> space. A team member recently got turned on to >>>>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application >>>>>>>> layer for orchestrating complex workflows/scenarios. He >>>>>>>> asked me if Spark had an "application layer"? My initial reaction is >>>>>>>> "no" >>>>>>>> that Spark would not have a separate orchestration/application layer. >>>>>>>> Instead, the core Spark API (along with Streaming) would compete >>>>>>>> directly >>>>>>>> with Cascading for this kind of functionality and that the two would >>>>>>>> not >>>>>>>> likely be all that complementary. I realize that I am exposing my >>>>>>>> ignorance here and could be way off. Is there anyone who knows a bit >>>>>>>> about >>>>>>>> both of these technologies who could speak to this in broad strokes? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Philip >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >> > -- s
