This is a longstanding Hadoop issue.

Your suggestion is interesting, but only a few cases would benefit.  The
problem is that splitting involves reading from a very small number of
nodes and thus is not much better than just running the program with few
mappers.  If the data is large enough to make splitting fast, then Hadoop
will just do it.

The only win for splitting is when the cost per chunk is very high.  I
think that only random forest might fit into that category.

On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister <
[email protected]> wrote:

> Splitting the files leads to multiple MR-tasks!
>
> Only changing the MR settings of hadoop did not help. In the future it
> would be nice if the drivers would scale themself and would split the
> data according to the dataset size and the number of available MR-slots.
>

Reply via email to