Re: Number of Clustering MR-Jobs

Sebastian Briesemeister Thu, 28 Mar 2013 05:53:09 -0700

In my case, each map processes requires a lot of memory and I would like
to distribute this consumption on multiple nodes.


However, I still get out of memory exceptions even if I split the input
file into several very small input files??? I though the mapper would
consider only one file at a time and would, hence, have no problems with
heap space?



Am 28.03.2013 10:20, schrieb Ted Dunning:
> This is a longstanding Hadoop issue.
>
> Your suggestion is interesting, but only a few cases would benefit.  The
> problem is that splitting involves reading from a very small number of
> nodes and thus is not much better than just running the program with few
> mappers.  If the data is large enough to make splitting fast, then Hadoop
> will just do it.
>
> The only win for splitting is when the cost per chunk is very high.  I
> think that only random forest might fit into that category.
>
> On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
> [email protected]> wrote:
>
>> Splitting the files leads to multiple MR-tasks!
>>
>> Only changing the MR settings of hadoop did not help. In the future it
>> would be nice if the drivers would scale themself and would split the
>> data according to the dataset size and the number of available MR-slots.
>>

Re: Number of Clustering MR-Jobs

Reply via email to