In my case, each map processes requires a lot of memory and I would like to distribute this consumption on multiple nodes.
However, I still get out of memory exceptions even if I split the input file into several very small input files??? I though the mapper would consider only one file at a time and would, hence, have no problems with heap space? Am 28.03.2013 10:20, schrieb Ted Dunning: > This is a longstanding Hadoop issue. > > Your suggestion is interesting, but only a few cases would benefit. The > problem is that splitting involves reading from a very small number of > nodes and thus is not much better than just running the program with few > mappers. If the data is large enough to make splitting fast, then Hadoop > will just do it. > > The only win for splitting is when the cost per chunk is very high. I > think that only random forest might fit into that category. > > On Thu, Mar 28, 2013 at 10:04 AM, Sebastian Briesemeister < > [email protected]> wrote: > >> Splitting the files leads to multiple MR-tasks! >> >> Only changing the MR settings of hadoop did not help. In the future it >> would be nice if the drivers would scale themself and would split the >> data according to the dataset size and the number of available MR-slots. >>
