Hey Prashant,
I assume you mean steps to reproduce the OOM. I do not currently. I just
ran into them when porting some jobs from map-red. I never turned it into a
reproducible test, and i do not exclude that it was my poor programming
that caused it. However it happened with a bunch of jobs, and
My team is investigating a number of technologies in the Big Data
space. A team member recently got turned on to Cascading
http://www.cascading.org/about-cascading/ as an application layer for
orchestrating complex workflows/scenarios. He asked me if Spark had an
application layer? My
Hi Philip,
Cascading is relatively agnostic about the distributed topology underneath
it, especially as of the 2.0 release over a year ago. There's been some
discussion about writing a flow planner for Spark -- e.g., which would
replace the Hadoop flow planner. Not sure if there's active work on
1) when you say Cascading is relatively agnostic about the distributed
topology underneath it I take that as a hedge that suggests that while it
could be possible to run Spark underneath Cascading this is not something
commonly done or would necessarily be straightforward. Is this an unfair
And I didn't mean to skip over you, Koert. I'm just more familiar with
what Oscar said on the subject than with your opinion.
On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote:
Hmmm... I was unaware of this concept that Spark is for medium to large
datasets but not
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
caching is the input to each reduce task. Those currently don't spill to disk.
The solution if datasets are large is to add more reduce tasks, whereas Hadoop
would run along with a small number of tasks that do lots
By the way, the reason we have this goal is simple -- nobody wants to be
managing different compute engines for the same computation. For established
MapReduce users, it may be easy to write the same code on MR, but we have lots
of users who've never installed MR and don't want to manage it. So
i am actually not familiar with what oscar has said on this. can you share
or point me to the conversation thread?
One of the places was is this panel
discussionhttp://www.meetup.com/hadoopsf/events/141368262/,
but it doesn't look like there is a recording of it available, so I guess
that's
Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as it reduces.
I think spark should be able to handle that situation to truly be able to
claim it can replace map-red (or not?).
Best,
Hey Koert,
Can you give me steps to reproduce this ?
On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers ko...@tresata.com wrote:
Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as
10 matches
Mail list logo