Re: compare/contrast Spark with Cascading

Koert Kuipers Tue, 29 Oct 2013 07:42:27 -0700

Hey Prashant,
I assume you mean steps to reproduce the OOM. I do not currently. I just
ran into them when porting some jobs from map-red. I never turned it into a
reproducible test, and i do not exclude that it was my poor programming
that caused it. However it happened with a bunch of jobs, and then i asked
on the message boards about the OOM, and people pointed me to the
assumption about reducer input having to fit in memory. At that point i
felt like that was too much of a limitation for the jobs i was trying to
port and i gave up.



On Tue, Oct 29, 2013 at 1:12 AM, Prashant Sharma <scrapco...@gmail.com>wrote:

> Hey Koert,
>
> Can you give me steps to reproduce this ?
>
>
> On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Matei,
>> We have some jobs where even the input for a single key in a groupBy
>> would not fit in the the tasks memory. We rely on mapred to stream from
>> disk to disk as it reduces.
>> I think spark should be able to handle that situation to truly be able to
>> claim it can replace map-red (or not?).
>> Best, Koert
>>
>>
>> On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia 
>> <matei.zaha...@gmail.com>wrote:
>>
>>> FWIW, the only thing that Spark expects to fit in memory if you use
>>> DISK_ONLY caching is the input to each reduce task. Those currently don't
>>> spill to disk. The solution if datasets are large is to add more reduce
>>> tasks, whereas Hadoop would run along with a small number of tasks that do
>>> lots of disk IO. But this is something we will likely change soon. Other
>>> than that, everything runs in a streaming fashion and there's no need for
>>> the data to fit in memory. Our goal is certainly to work on any size
>>> datasets, and some of our current users are explicitly using Spark to
>>> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
>>> presentation from http://ampcamp.berkeley.edu/3/). If you run into
>>> trouble with these, let us know, since it is an explicit goal of the
>>> project to support it.
>>>
>>> Matei
>>>
>>> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> no problem :) i am actually not familiar with what oscar has said on
>>> this. can you share or point me to the conversation thread?
>>>
>>> it is my opinion based on the little experimenting i have done. but i am
>>> willing to be convinced otherwise.
>>> one the very first things i did when we started using spark is run jobs
>>> with DISK_ONLY, and see if it could some of the jobs that map-reduce does
>>> for us. however i ran into OOMs, presumably because spark makes assumptions
>>> that some things should fit in memory. i have to admit i didn't try too
>>> hard after the first OOMs.
>>>
>>> if spark were able to scale from the quick in-memory query to the
>>> overnight disk-only giant batch query, i would love it! spark has a much
>>> nicer api than map-red, and one could use a single set of algos for
>>> everything from quick/realtime queries to giant batch jobs. as far as i am
>>> concerned map-red would be done. our clusters of the future would be hdfs +
>>> spark.
>>>
>>>
>>> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra 
>>> <m...@clearstorydata.com>wrote:
>>>
>>>> And I didn't mean to skip over you, Koert.  I'm just more familiar with
>>>> what Oscar said on the subject than with your opinion.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra 
>>>> <m...@clearstorydata.com>wrote:
>>>>
>>>>> Hmmm... I was unaware of this concept that Spark is for medium to
>>>>>> large datasets but not for very large datasets.
>>>>>
>>>>>
>>>>> It is in the opinion of some at Twitter.  That doesn't make it true or
>>>>> a universally held opinion.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arang...@gmail.com>wrote:
>>>>>
>>>>>> Hmmm... I was unaware of this concept that Spark is for medium to
>>>>>> large datasets but not for very large datasets. What size is very large?
>>>>>>
>>>>>> Can someone please elaborate on why this would be the case and what
>>>>>> stops Spark, as it is today, to be successfully run on very large 
>>>>>> datasets?
>>>>>> I'll appreciate it.
>>>>>>
>>>>>> I would think that Spark should be able to pull off Hadoop level
>>>>>> throughput in worst case with DISK_ONLY caching.
>>>>>>
>>>>>> Thanks
>>>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>
>>>>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>>>>> functionality to spark, and a similar syntax.
>>>>>>> the main difference between spark and scalding is target jobs:
>>>>>>> scalding is for long running jobs on very large data. the data is
>>>>>>> read from and written to disk between steps. jobs run from minutes to 
>>>>>>> days.
>>>>>>> spark is for faster jobs on medium to large data. the data is
>>>>>>> primarily held in memory. jobs run from a few seconds to a few hours.
>>>>>>> although spark can work with data on disks it still makes assumptions 
>>>>>>> that
>>>>>>> data needs to fit in memory for certain steps (although less and less 
>>>>>>> with
>>>>>>> every release). spark also makes iterative designs much easier.
>>>>>>>
>>>>>>> i have found them both great to program in and complimentary. we use
>>>>>>> scalding for overnight batch processes and spark for more realtime
>>>>>>> processes. at this point i would trust scalding a lot more due to the
>>>>>>> robustness of the stack, but spark is getting better every day.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <cet...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Philip,
>>>>>>>>
>>>>>>>> Cascading is relatively agnostic about the distributed topology
>>>>>>>> underneath it, especially as of the 2.0 release over a year ago. 
>>>>>>>> There's
>>>>>>>> been some discussion about writing a flow planner for Spark -- e.g., 
>>>>>>>> which
>>>>>>>> would replace the Hadoop flow planner. Not sure if there's active work 
>>>>>>>> on
>>>>>>>> that yet.
>>>>>>>>
>>>>>>>> There are a few commercial workflow abstraction layers (probably
>>>>>>>> what was meant by "application layer" ?), in terms of the Cascading 
>>>>>>>> family
>>>>>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in 
>>>>>>>> the
>>>>>>>> Py data stack.
>>>>>>>>
>>>>>>>> Spark would not be at the same level of abstraction as Cascading
>>>>>>>> (business logic, effectively); however, something like MLbase is 
>>>>>>>> ostensibly
>>>>>>>> intended for that http://www.mlbase.org/
>>>>>>>>
>>>>>>>> With respect to Spark, two other things to watch... One would
>>>>>>>> definitely be the Py data stack and ability to integrate with PySpark,
>>>>>>>> which is turning out to be very power abstraction -- quite close to a 
>>>>>>>> large
>>>>>>>> segment of industry needs.  The other project to watch, on the
>>>>>>>> Scala side, is Summingbird and it's evolution at Twitter:
>>>>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>>>>>>>>
>>>>>>>> Paco
>>>>>>>> http://amazon.com/dp/1449358721/
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <
>>>>>>>> philip.og...@oracle.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> My team is investigating a number of technologies in the Big Data
>>>>>>>>> space.  A team member recently got turned on to 
>>>>>>>>> Cascading<http://www.cascading.org/about-cascading/>as an application 
>>>>>>>>> layer for orchestrating complex workflows/scenarios.  He
>>>>>>>>> asked me if Spark had an "application layer"?  My initial reaction is 
>>>>>>>>> "no"
>>>>>>>>> that Spark would not have a separate orchestration/application layer.
>>>>>>>>> Instead, the core Spark API (along with Streaming) would compete 
>>>>>>>>> directly
>>>>>>>>> with Cascading for this kind of functionality and that the two would 
>>>>>>>>> not
>>>>>>>>> likely be all that complementary.  I realize that I am exposing my
>>>>>>>>> ignorance here and could be way off.  Is there anyone who knows a bit 
>>>>>>>>> about
>>>>>>>>> both of these technologies who could speak to this in broad strokes?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Philip
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> --
> s
>

Re: compare/contrast Spark with Cascading

Reply via email to