Yes. but would task tracker check it? i am quite dubious about that. the only resource manager i know that does these tricks is mesos. And then framework (such as spark) should then support it too. With spark, indeed, you can do these things on per-session basis (so called "coarse grained" vs. "fine grained" scheduling). And then this probably needs to be integrated properly with broadcasting mechanism.
On Thu, Jun 13, 2013 at 1:41 PM, Suneel Marthi <[email protected]>wrote: > You should be able to programmatically override the setting for > mapred.tasktracker.map.tasks.maximum if its not marked as 'final' in your > Hadoop setup. > Check your mapred-site.xml to verify that. > > In my env its marked as final so I don't have the luxury of overriding it. > > > > > ________________________________ > From: Sebastian Schelter <[email protected]> > To: [email protected] > Sent: Thursday, June 13, 2013 4:36 PM > Subject: Re: LDA/CVB Performance > > > I'd have to recheck with my colleague, but I'm pretty sure it worked > when we tried it. What should prevent the jobtracker from scheduling > only one map task per machine/tasktracker for a specific job? > > > -sebastian > > On 13.06.2013 22:31, Alan Gardner wrote: > > mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, > I > > don't think you can configure it per job. More granular resource control > is > > a job for Mesos or YARN, MR doesn't support this sort of thing. > > > > I think for our deployment we'll carve out a chunk of the cluster, sized > > and configured exclusively to do ML 24/7. If we don't need that much > > capacity, it might be better to spin up an Elastic Map Reduce cluster > for a > > few hours every day. > > > > > > On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <[email protected]> > wrote: > > > >> On 13.06.2013 22:12, Dmitriy Lyubimov wrote: > >>> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]> > >> wrote: > >>> > >>>> This table is readonly, right? We could try to apply the trick from > our > >>>> ALS code: Instead of running one mapper per core (and thus having one > >>>> copy of the table per core), run a multithreaded mapper and share the > >>>> table between its threads. Works very well for ALS. > >>> > >>> > >>> Just out of my ignorance, how will you tell MR that your mapper is > using > >>> more than 1 core and that it doesn't have to run more than 1 mapepr of > >>> that time per box? > >> > >> You need to use a MultithreadedMapper for which you can set the size of > >> the thread pool via MultithreadedMapper.setNumberOfThreads(...) > >> > >> You can configure the maximum number of mappers to run per task tracker > >> with -Dmapred.tasktracker.map.tasks.maximum=x > >> > >>> > >>> > >>>> We can also cache > >>>> the table in a static variable and make Hadoop reuse JVMs, which > >>>> increases performance if the number of blocks to process is larger > than > >>>> the number of map slots. > >>>> > >>> > >>> This usually (or might be) something the admin doesn't let us override. > >>> Also i am not sure if jvm reuse in hadoop is isolated between different > >>> jobs (so other tasks may inherit that stuff they probably don't want) > >> > >> The jvm is only reused during a single job. > >> > >>>> > >>>> -sebastian > >>>> > >>>> On 13.06.2013 21:56, Ted Dunning wrote: > >>>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> > >>>> wrote: > >>>>> > >>>>>> Andy, note that he said he's running with a 1.6M-term dictionary. > >>>> That's > >>>>>> going > >>>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. > >>>> Still > >>>>>> not hitting > >>>>>> 8GB, but getting closer. > >>>>>> > >>>>> > >>>>> It will likely be even worse unless this table is shared between > >> mappers. > >>>>> With 8 mappers per node, this goes to 41GB. The OP didn't mention > >>>> machine > >>>>> configuration, but this could easily cause swapping. > >>>>> > >>>> > >>>> > >>> > >> > >> > > > > >
