Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as it reduces.
I think spark should be able to handle that situation to truly be able to
claim it can replace map-red (or not?).
Best, Koert


On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia <[email protected]>wrote:

> FWIW, the only thing that Spark expects to fit in memory if you use
> DISK_ONLY caching is the input to each reduce task. Those currently don't
> spill to disk. The solution if datasets are large is to add more reduce
> tasks, whereas Hadoop would run along with a small number of tasks that do
> lots of disk IO. But this is something we will likely change soon. Other
> than that, everything runs in a streaming fashion and there's no need for
> the data to fit in memory. Our goal is certainly to work on any size
> datasets, and some of our current users are explicitly using Spark to
> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
> presentation from http://ampcamp.berkeley.edu/3/). If you run into
> trouble with these, let us know, since it is an explicit goal of the
> project to support it.
>
> Matei
>
> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <[email protected]> wrote:
>
> no problem :) i am actually not familiar with what oscar has said on this.
> can you share or point me to the conversation thread?
>
> it is my opinion based on the little experimenting i have done. but i am
> willing to be convinced otherwise.
> one the very first things i did when we started using spark is run jobs
> with DISK_ONLY, and see if it could some of the jobs that map-reduce does
> for us. however i ran into OOMs, presumably because spark makes assumptions
> that some things should fit in memory. i have to admit i didn't try too
> hard after the first OOMs.
>
> if spark were able to scale from the quick in-memory query to the
> overnight disk-only giant batch query, i would love it! spark has a much
> nicer api than map-red, and one could use a single set of algos for
> everything from quick/realtime queries to giant batch jobs. as far as i am
> concerned map-red would be done. our clusters of the future would be hdfs +
> spark.
>
>
> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <[email protected]>wrote:
>
>> And I didn't mean to skip over you, Koert.  I'm just more familiar with
>> what Oscar said on the subject than with your opinion.
>>
>>
>>
>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <[email protected]>wrote:
>>
>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>>> datasets but not for very large datasets.
>>>
>>>
>>> It is in the opinion of some at Twitter.  That doesn't make it true or a
>>> universally held opinion.
>>>
>>>
>>>
>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <[email protected]>wrote:
>>>
>>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>>> datasets but not for very large datasets. What size is very large?
>>>>
>>>> Can someone please elaborate on why this would be the case and what
>>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>>> I'll appreciate it.
>>>>
>>>> I would think that Spark should be able to pull off Hadoop level
>>>> throughput in worst case with DISK_ONLY caching.
>>>>
>>>> Thanks
>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <[email protected]> wrote:
>>>>
>>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>>> functionality to spark, and a similar syntax.
>>>>> the main difference between spark and scalding is target jobs:
>>>>> scalding is for long running jobs on very large data. the data is read
>>>>> from and written to disk between steps. jobs run from minutes to days.
>>>>> spark is for faster jobs on medium to large data. the data is
>>>>> primarily held in memory. jobs run from a few seconds to a few hours.
>>>>> although spark can work with data on disks it still makes assumptions that
>>>>> data needs to fit in memory for certain steps (although less and less with
>>>>> every release). spark also makes iterative designs much easier.
>>>>>
>>>>> i have found them both great to program in and complimentary. we use
>>>>> scalding for overnight batch processes and spark for more realtime
>>>>> processes. at this point i would trust scalding a lot more due to the
>>>>> robustness of the stack, but spark is getting better every day.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <[email protected]> wrote:
>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>> Cascading is relatively agnostic about the distributed topology
>>>>>> underneath it, especially as of the 2.0 release over a year ago. There's
>>>>>> been some discussion about writing a flow planner for Spark -- e.g., 
>>>>>> which
>>>>>> would replace the Hadoop flow planner. Not sure if there's active work on
>>>>>> that yet.
>>>>>>
>>>>>> There are a few commercial workflow abstraction layers (probably what
>>>>>> was meant by "application layer" ?), in terms of the Cascading family
>>>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the
>>>>>> Py data stack.
>>>>>>
>>>>>> Spark would not be at the same level of abstraction as Cascading
>>>>>> (business logic, effectively); however, something like MLbase is 
>>>>>> ostensibly
>>>>>> intended for that http://www.mlbase.org/
>>>>>>
>>>>>> With respect to Spark, two other things to watch... One would
>>>>>> definitely be the Py data stack and ability to integrate with PySpark,
>>>>>> which is turning out to be very power abstraction -- quite close to a 
>>>>>> large
>>>>>> segment of industry needs.  The other project to watch, on the Scala
>>>>>> side, is Summingbird and it's evolution at Twitter:
>>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>>>>>>
>>>>>> Paco
>>>>>> http://amazon.com/dp/1449358721/
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> My team is investigating a number of technologies in the Big Data
>>>>>>> space.  A team member recently got turned on to 
>>>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application 
>>>>>>> layer for orchestrating complex workflows/scenarios.  He
>>>>>>> asked me if Spark had an "application layer"?  My initial reaction is 
>>>>>>> "no"
>>>>>>> that Spark would not have a separate orchestration/application layer.
>>>>>>> Instead, the core Spark API (along with Streaming) would compete 
>>>>>>> directly
>>>>>>> with Cascading for this kind of functionality and that the two would not
>>>>>>> likely be all that complementary.  I realize that I am exposing my
>>>>>>> ignorance here and could be way off.  Is there anyone who knows a bit 
>>>>>>> about
>>>>>>> both of these technologies who could speak to this in broad strokes?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Philip
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>
>

Reply via email to