Hey Prashant, I assume you mean steps to reproduce the OOM. I do not currently. I just ran into them when porting some jobs from map-red. I never turned it into a reproducible test, and i do not exclude that it was my poor programming that caused it. However it happened with a bunch of jobs, and then i asked on the message boards about the OOM, and people pointed me to the assumption about reducer input having to fit in memory. At that point i felt like that was too much of a limitation for the jobs i was trying to port and i gave up.
On Tue, Oct 29, 2013 at 1:12 AM, Prashant Sharma <scrapco...@gmail.com>wrote: > Hey Koert, > > Can you give me steps to reproduce this ? > > > On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Matei, >> We have some jobs where even the input for a single key in a groupBy >> would not fit in the the tasks memory. We rely on mapred to stream from >> disk to disk as it reduces. >> I think spark should be able to handle that situation to truly be able to >> claim it can replace map-red (or not?). >> Best, Koert >> >> >> On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia >> <matei.zaha...@gmail.com>wrote: >> >>> FWIW, the only thing that Spark expects to fit in memory if you use >>> DISK_ONLY caching is the input to each reduce task. Those currently don't >>> spill to disk. The solution if datasets are large is to add more reduce >>> tasks, whereas Hadoop would run along with a small number of tasks that do >>> lots of disk IO. But this is something we will likely change soon. Other >>> than that, everything runs in a streaming fashion and there's no need for >>> the data to fit in memory. Our goal is certainly to work on any size >>> datasets, and some of our current users are explicitly using Spark to >>> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s >>> presentation from http://ampcamp.berkeley.edu/3/). If you run into >>> trouble with these, let us know, since it is an explicit goal of the >>> project to support it. >>> >>> Matei >>> >>> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>> no problem :) i am actually not familiar with what oscar has said on >>> this. can you share or point me to the conversation thread? >>> >>> it is my opinion based on the little experimenting i have done. but i am >>> willing to be convinced otherwise. >>> one the very first things i did when we started using spark is run jobs >>> with DISK_ONLY, and see if it could some of the jobs that map-reduce does >>> for us. however i ran into OOMs, presumably because spark makes assumptions >>> that some things should fit in memory. i have to admit i didn't try too >>> hard after the first OOMs. >>> >>> if spark were able to scale from the quick in-memory query to the >>> overnight disk-only giant batch query, i would love it! spark has a much >>> nicer api than map-red, and one could use a single set of algos for >>> everything from quick/realtime queries to giant batch jobs. as far as i am >>> concerned map-red would be done. our clusters of the future would be hdfs + >>> spark. >>> >>> >>> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra >>> <m...@clearstorydata.com>wrote: >>> >>>> And I didn't mean to skip over you, Koert. I'm just more familiar with >>>> what Oscar said on the subject than with your opinion. >>>> >>>> >>>> >>>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra >>>> <m...@clearstorydata.com>wrote: >>>> >>>>> Hmmm... I was unaware of this concept that Spark is for medium to >>>>>> large datasets but not for very large datasets. >>>>> >>>>> >>>>> It is in the opinion of some at Twitter. That doesn't make it true or >>>>> a universally held opinion. >>>>> >>>>> >>>>> >>>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arang...@gmail.com>wrote: >>>>> >>>>>> Hmmm... I was unaware of this concept that Spark is for medium to >>>>>> large datasets but not for very large datasets. What size is very large? >>>>>> >>>>>> Can someone please elaborate on why this would be the case and what >>>>>> stops Spark, as it is today, to be successfully run on very large >>>>>> datasets? >>>>>> I'll appreciate it. >>>>>> >>>>>> I would think that Spark should be able to pull off Hadoop level >>>>>> throughput in worst case with DISK_ONLY caching. >>>>>> >>>>>> Thanks >>>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <ko...@tresata.com> wrote: >>>>>> >>>>>>> i would say scaling (cascading + DSL for scala) offers similar >>>>>>> functionality to spark, and a similar syntax. >>>>>>> the main difference between spark and scalding is target jobs: >>>>>>> scalding is for long running jobs on very large data. the data is >>>>>>> read from and written to disk between steps. jobs run from minutes to >>>>>>> days. >>>>>>> spark is for faster jobs on medium to large data. the data is >>>>>>> primarily held in memory. jobs run from a few seconds to a few hours. >>>>>>> although spark can work with data on disks it still makes assumptions >>>>>>> that >>>>>>> data needs to fit in memory for certain steps (although less and less >>>>>>> with >>>>>>> every release). spark also makes iterative designs much easier. >>>>>>> >>>>>>> i have found them both great to program in and complimentary. we use >>>>>>> scalding for overnight batch processes and spark for more realtime >>>>>>> processes. at this point i would trust scalding a lot more due to the >>>>>>> robustness of the stack, but spark is getting better every day. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <cet...@gmail.com>wrote: >>>>>>> >>>>>>>> Hi Philip, >>>>>>>> >>>>>>>> Cascading is relatively agnostic about the distributed topology >>>>>>>> underneath it, especially as of the 2.0 release over a year ago. >>>>>>>> There's >>>>>>>> been some discussion about writing a flow planner for Spark -- e.g., >>>>>>>> which >>>>>>>> would replace the Hadoop flow planner. Not sure if there's active work >>>>>>>> on >>>>>>>> that yet. >>>>>>>> >>>>>>>> There are a few commercial workflow abstraction layers (probably >>>>>>>> what was meant by "application layer" ?), in terms of the Cascading >>>>>>>> family >>>>>>>> (incl. Cascalog, Scalding), and also Actian's integration of >>>>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in >>>>>>>> the >>>>>>>> Py data stack. >>>>>>>> >>>>>>>> Spark would not be at the same level of abstraction as Cascading >>>>>>>> (business logic, effectively); however, something like MLbase is >>>>>>>> ostensibly >>>>>>>> intended for that http://www.mlbase.org/ >>>>>>>> >>>>>>>> With respect to Spark, two other things to watch... One would >>>>>>>> definitely be the Py data stack and ability to integrate with PySpark, >>>>>>>> which is turning out to be very power abstraction -- quite close to a >>>>>>>> large >>>>>>>> segment of industry needs. The other project to watch, on the >>>>>>>> Scala side, is Summingbird and it's evolution at Twitter: >>>>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird >>>>>>>> >>>>>>>> Paco >>>>>>>> http://amazon.com/dp/1449358721/ >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren < >>>>>>>> philip.og...@oracle.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> My team is investigating a number of technologies in the Big Data >>>>>>>>> space. A team member recently got turned on to >>>>>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application >>>>>>>>> layer for orchestrating complex workflows/scenarios. He >>>>>>>>> asked me if Spark had an "application layer"? My initial reaction is >>>>>>>>> "no" >>>>>>>>> that Spark would not have a separate orchestration/application layer. >>>>>>>>> Instead, the core Spark API (along with Streaming) would compete >>>>>>>>> directly >>>>>>>>> with Cascading for this kind of functionality and that the two would >>>>>>>>> not >>>>>>>>> likely be all that complementary. I realize that I am exposing my >>>>>>>>> ignorance here and could be way off. Is there anyone who knows a bit >>>>>>>>> about >>>>>>>>> both of these technologies who could speak to this in broad strokes? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Philip >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >>> >> > > > -- > s >