mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, I don't think you can configure it per job. More granular resource control is a job for Mesos or YARN, MR doesn't support this sort of thing.
I think for our deployment we'll carve out a chunk of the cluster, sized and configured exclusively to do ML 24/7. If we don't need that much capacity, it might be better to spin up an Elastic Map Reduce cluster for a few hours every day. On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <[email protected]> wrote: > On 13.06.2013 22:12, Dmitriy Lyubimov wrote: > > On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]> > wrote: > > > >> This table is readonly, right? We could try to apply the trick from our > >> ALS code: Instead of running one mapper per core (and thus having one > >> copy of the table per core), run a multithreaded mapper and share the > >> table between its threads. Works very well for ALS. > > > > > > Just out of my ignorance, how will you tell MR that your mapper is using > > more than 1 core and that it doesn't have to run more than 1 mapepr of > > that time per box? > > You need to use a MultithreadedMapper for which you can set the size of > the thread pool via MultithreadedMapper.setNumberOfThreads(...) > > You can configure the maximum number of mappers to run per task tracker > with -Dmapred.tasktracker.map.tasks.maximum=x > > > > > > >> We can also cache > >> the table in a static variable and make Hadoop reuse JVMs, which > >> increases performance if the number of blocks to process is larger than > >> the number of map slots. > >> > > > > This usually (or might be) something the admin doesn't let us override. > > Also i am not sure if jvm reuse in hadoop is isolated between different > > jobs (so other tasks may inherit that stuff they probably don't want) > > The jvm is only reused during a single job. > > >> > >> -sebastian > >> > >> On 13.06.2013 21:56, Ted Dunning wrote: > >>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> > >> wrote: > >>> > >>>> Andy, note that he said he's running with a 1.6M-term dictionary. > >> That's > >>>> going > >>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. > >> Still > >>>> not hitting > >>>> 8GB, but getting closer. > >>>> > >>> > >>> It will likely be even worse unless this table is shared between > mappers. > >>> With 8 mappers per node, this goes to 41GB. The OP didn't mention > >> machine > >>> configuration, but this could easily cause swapping. > >>> > >> > >> > > > > -- Alan Gardner Solutions Architect - CTO Office [email protected] | LinkedIn: http://www.linkedin.com/profile/view?id=65508699 | @alanctgardner<https://twitter.com/alanctgardner> Tel: +1 613 565 8696 x1218 Mobile: +1 613 897 5655 -- --
