Re: compare/contrast Spark with Cascading

2013-10-29 Thread Koert Kuipers
Hey Prashant, I assume you mean steps to reproduce the OOM. I do not currently. I just ran into them when porting some jobs from map-red. I never turned it into a reproducible test, and i do not exclude that it was my poor programming that caused it. However it happened with a bunch of jobs, and

compare/contrast Spark with Cascading

2013-10-28 Thread Philip Ogren
My team is investigating a number of technologies in the Big Data space. A team member recently got turned on to Cascading http://www.cascading.org/about-cascading/ as an application layer for orchestrating complex workflows/scenarios. He asked me if Spark had an application layer? My

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Paco Nathan
Hi Philip, Cascading is relatively agnostic about the distributed topology underneath it, especially as of the 2.0 release over a year ago. There's been some discussion about writing a flow planner for Spark -- e.g., which would replace the Hadoop flow planner. Not sure if there's active work on

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
1) when you say Cascading is relatively agnostic about the distributed topology underneath it I take that as a hedge that suggests that while it could be possible to run Spark underneath Cascading this is not something commonly done or would necessarily be straightforward. Is this an unfair

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
And I didn't mean to skip over you, Koert. I'm just more familiar with what Oscar said on the subject than with your opinion. On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote: Hmmm... I was unaware of this concept that Spark is for medium to large datasets but not

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY caching is the input to each reduce task. Those currently don't spill to disk. The solution if datasets are large is to add more reduce tasks, whereas Hadoop would run along with a small number of tasks that do lots

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
By the way, the reason we have this goal is simple -- nobody wants to be managing different compute engines for the same computation. For established MapReduce users, it may be easy to write the same code on MR, but we have lots of users who've never installed MR and don't want to manage it. So

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
i am actually not familiar with what oscar has said on this. can you share or point me to the conversation thread? One of the places was is this panel discussionhttp://www.meetup.com/hadoopsf/events/141368262/, but it doesn't look like there is a recording of it available, so I guess that's

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
Matei, We have some jobs where even the input for a single key in a groupBy would not fit in the the tasks memory. We rely on mapred to stream from disk to disk as it reduces. I think spark should be able to handle that situation to truly be able to claim it can replace map-red (or not?). Best,

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Prashant Sharma
Hey Koert, Can you give me steps to reproduce this ? On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers ko...@tresata.com wrote: Matei, We have some jobs where even the input for a single key in a groupBy would not fit in the the tasks memory. We rely on mapred to stream from disk to disk as