I'd have to recheck with my colleague, but I'm pretty sure it worked when we tried it. What should prevent the jobtracker from scheduling only one map task per machine/tasktracker for a specific job?
-sebastian On 13.06.2013 22:31, Alan Gardner wrote: > mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, I > don't think you can configure it per job. More granular resource control is > a job for Mesos or YARN, MR doesn't support this sort of thing. > > I think for our deployment we'll carve out a chunk of the cluster, sized > and configured exclusively to do ML 24/7. If we don't need that much > capacity, it might be better to spin up an Elastic Map Reduce cluster for a > few hours every day. > > > On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <[email protected]> wrote: > >> On 13.06.2013 22:12, Dmitriy Lyubimov wrote: >>> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]> >> wrote: >>> >>>> This table is readonly, right? We could try to apply the trick from our >>>> ALS code: Instead of running one mapper per core (and thus having one >>>> copy of the table per core), run a multithreaded mapper and share the >>>> table between its threads. Works very well for ALS. >>> >>> >>> Just out of my ignorance, how will you tell MR that your mapper is using >>> more than 1 core and that it doesn't have to run more than 1 mapepr of >>> that time per box? >> >> You need to use a MultithreadedMapper for which you can set the size of >> the thread pool via MultithreadedMapper.setNumberOfThreads(...) >> >> You can configure the maximum number of mappers to run per task tracker >> with -Dmapred.tasktracker.map.tasks.maximum=x >> >>> >>> >>>> We can also cache >>>> the table in a static variable and make Hadoop reuse JVMs, which >>>> increases performance if the number of blocks to process is larger than >>>> the number of map slots. >>>> >>> >>> This usually (or might be) something the admin doesn't let us override. >>> Also i am not sure if jvm reuse in hadoop is isolated between different >>> jobs (so other tasks may inherit that stuff they probably don't want) >> >> The jvm is only reused during a single job. >> >>>> >>>> -sebastian >>>> >>>> On 13.06.2013 21:56, Ted Dunning wrote: >>>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> >>>> wrote: >>>>> >>>>>> Andy, note that he said he's running with a 1.6M-term dictionary. >>>> That's >>>>>> going >>>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. >>>> Still >>>>>> not hitting >>>>>> 8GB, but getting closer. >>>>>> >>>>> >>>>> It will likely be even worse unless this table is shared between >> mappers. >>>>> With 8 mappers per node, this goes to 41GB. The OP didn't mention >>>> machine >>>>> configuration, but this could easily cause swapping. >>>>> >>>> >>>> >>> >> >> > >
