Re: Number of Clustering MR-Jobs

Dan Filimon Thu, 28 Mar 2013 08:43:06 -0700

>From what I've seen, even if the mapper does throw an out of memory
exception, Hadoop will restart it increasing the memory.


There are ways to configure the mapper/reducer JVMs to use more memory by
default through the Configuration although I don't recall the exact
options. It's probably documented in your Hadoop distribution's
documentation.


On Thu, Mar 28, 2013 at 2:52 PM, Sebastian Briesemeister <
[email protected]> wrote:

> In my case, each map processes requires a lot of memory and I would like
> to distribute this consumption on multiple nodes.
>
> However, I still get out of memory exceptions even if I split the input
> file into several very small input files??? I though the mapper would
> consider only one file at a time and would, hence, have no problems with
> heap space?
>
>
>
> Am 28.03.2013 10:20, schrieb Ted Dunning:
> > This is a longstanding Hadoop issue.
> >
> > Your suggestion is interesting, but only a few cases would benefit.  The
> > problem is that splitting involves reading from a very small number of
> > nodes and thus is not much better than just running the program with few
> > mappers.  If the data is large enough to make splitting fast, then Hadoop
> > will just do it.
> >
> > The only win for splitting is when the cost per chunk is very high.  I
> > think that only random forest might fit into that category.
> >
> > On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> > [email protected]> wrote:
> >
> >> Splitting the files leads to multiple MR-tasks!
> >>
> >> Only changing the MR settings of hadoop did not help. In the future it
> >> would be nice if the drivers would scale themself and would split the
> >> data according to the dataset size and the number of available MR-slots.
> >>
>
>

Re: Number of Clustering MR-Jobs

Reply via email to