By the way, the reason we have this goal is simple -- nobody wants to be 
managing different compute engines for the same computation. For established 
MapReduce users, it may be easy to write the same code on MR, but we have lots 
of users who've never installed MR and don't want to manage it. So of course we 
develop features and optimizations as we see demand for them, but if there's a 
lot of demand for this, we can do it.

Matei

On Oct 28, 2013, at 5:51 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:

> FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY 
> caching is the input to each reduce task. Those currently don't spill to 
> disk. The solution if datasets are large is to add more reduce tasks, whereas 
> Hadoop would run along with a small number of tasks that do lots of disk IO. 
> But this is something we will likely change soon. Other than that, everything 
> runs in a streaming fashion and there's no need for the data to fit in 
> memory. Our goal is certainly to work on any size datasets, and some of our 
> current users are explicitly using Spark to replace things like Hadoop 
> Streaming in just batch jobs (see e.g. Yahoo!'s presentation from 
> http://ampcamp.berkeley.edu/3/). If you run into trouble with these, let us 
> know, since it is an explicit goal of the project to support it.
> 
> Matei
> 
> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
>> no problem :) i am actually not familiar with what oscar has said on this. 
>> can you share or point me to the conversation thread?
>> 
>> it is my opinion based on the little experimenting i have done. but i am 
>> willing to be convinced otherwise.
>> one the very first things i did when we started using spark is run jobs with 
>> DISK_ONLY, and see if it could some of the jobs that map-reduce does for us. 
>> however i ran into OOMs, presumably because spark makes assumptions that 
>> some things should fit in memory. i have to admit i didn't try too hard 
>> after the first OOMs.
>> 
>> if spark were able to scale from the quick in-memory query to the overnight 
>> disk-only giant batch query, i would love it! spark has a much nicer api 
>> than map-red, and one could use a single set of algos for everything from 
>> quick/realtime queries to giant batch jobs. as far as i am concerned map-red 
>> would be done. our clusters of the future would be hdfs + spark.
>> 
>> 
>> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <m...@clearstorydata.com> 
>> wrote:
>> And I didn't mean to skip over you, Koert.  I'm just more familiar with what 
>> Oscar said on the subject than with your opinion.
>> 
>> 
>> 
>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <m...@clearstorydata.com> 
>> wrote:
>> Hmmm... I was unaware of this concept that Spark is for medium to large 
>> datasets but not for very large datasets.
>>  
>> It is in the opinion of some at Twitter.  That doesn't make it true or a 
>> universally held opinion.
>> 
>> 
>> 
>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arang...@gmail.com> wrote:
>> Hmmm... I was unaware of this concept that Spark is for medium to large 
>> datasets but not for very large datasets. What size is very large?
>> 
>> Can someone please elaborate on why this would be the case and what stops 
>> Spark, as it is today, to be successfully run on very large datasets? I'll 
>> appreciate it.
>> 
>> I would think that Spark should be able to pull off Hadoop level throughput 
>> in worst case with DISK_ONLY caching.
>> 
>> Thanks
>> 
>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>> i would say scaling (cascading + DSL for scala) offers similar functionality 
>> to spark, and a similar syntax. 
>> the main difference between spark and scalding is target jobs: 
>> scalding is for long running jobs on very large data. the data is read from 
>> and written to disk between steps. jobs run from minutes to days.
>> spark is for faster jobs on medium to large data. the data is primarily held 
>> in memory. jobs run from a few seconds to a few hours. although spark can 
>> work with data on disks it still makes assumptions that data needs to fit in 
>> memory for certain steps (although less and less with every release). spark 
>> also makes iterative designs much easier.
>> 
>> i have found them both great to program in and complimentary. we use 
>> scalding for overnight batch processes and spark for more realtime 
>> processes. at this point i would trust scalding a lot more due to the 
>> robustness of the stack, but spark is getting better every day.
>> 
>> 
>> 
>> 
>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <cet...@gmail.com> wrote:
>> Hi Philip,
>> 
>> Cascading is relatively agnostic about the distributed topology underneath 
>> it, especially as of the 2.0 release over a year ago. There's been some 
>> discussion about writing a flow planner for Spark -- e.g., which would 
>> replace the Hadoop flow planner. Not sure if there's active work on that yet.
>> 
>> There are a few commercial workflow abstraction layers (probably what was 
>> meant by "application layer" ?), in terms of the Cascading family (incl. 
>> Cascalog, Scalding), and also Actian's integration of Hadoop/Knime/etc., and 
>> also the work by Continuum, ODG, and others in the Py data stack.
>> 
>> Spark would not be at the same level of abstraction as Cascading (business 
>> logic, effectively); however, something like MLbase is ostensibly intended 
>> for that http://www.mlbase.org/
>> 
>> With respect to Spark, two other things to watch... One would definitely be 
>> the Py data stack and ability to integrate with PySpark, which is turning 
>> out to be very power abstraction -- quite close to a large segment of 
>> industry needs.  The other project to watch, on the Scala side, is 
>> Summingbird and it's evolution at Twitter: 
>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>> 
>> Paco
>> http://amazon.com/dp/1449358721/
>> 
>> 
>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <philip.og...@oracle.com> 
>> wrote:
>> 
>> My team is investigating a number of technologies in the Big Data space.  A 
>> team member recently got turned on to Cascading as an application layer for 
>> orchestrating complex workflows/scenarios.  He asked me if Spark had an 
>> "application layer"?  My initial reaction is "no" that Spark would not have 
>> a separate orchestration/application layer.  Instead, the core Spark API 
>> (along with Streaming) would compete directly with Cascading for this kind 
>> of functionality and that the two would not likely be all that 
>> complementary.  I realize that I am exposing my ignorance here and could be 
>> way off.  Is there anyone who knows a bit about both of these technologies 
>> who could speak to this in broad strokes?  
>> 
>> Thanks!
>> Philip
>> 
>> 
>> 
>> 
>> 
>> 
> 

Reply via email to