I'd have to recheck with my colleague, but I'm pretty sure it worked
when we tried it. What should prevent the jobtracker from scheduling
only one map task per machine/tasktracker for a specific job?


-sebastian

On 13.06.2013 22:31, Alan Gardner wrote:
> mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, I
> don't think you can configure it per job. More granular resource control is
> a job for Mesos or YARN, MR doesn't support this sort of thing.
> 
> I think for our deployment we'll carve out a chunk of the cluster, sized
> and configured exclusively to do ML 24/7. If we don't need that much
> capacity, it might be better to spin up an Elastic Map Reduce cluster for a
> few hours every day.
> 
> 
> On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> On 13.06.2013 22:12, Dmitriy Lyubimov wrote:
>>> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]>
>> wrote:
>>>
>>>> This table is readonly, right? We could try to apply the trick from our
>>>> ALS code: Instead of running one mapper per core (and thus having one
>>>> copy of the table per core), run a multithreaded mapper and share the
>>>> table between its threads. Works very well for ALS.
>>>
>>>
>>> Just out of my ignorance, how will you tell MR that your mapper is using
>>> more than 1  core and that it doesn't have to run more than 1 mapepr of
>>> that time per box?
>>
>> You need to use a MultithreadedMapper for which you can set the size of
>> the thread pool via MultithreadedMapper.setNumberOfThreads(...)
>>
>> You can configure the maximum number of mappers to run per task tracker
>> with -Dmapred.tasktracker.map.tasks.maximum=x
>>
>>>
>>>
>>>> We can also cache
>>>> the table in a static variable and make Hadoop reuse JVMs, which
>>>> increases performance if the number of blocks to process is larger than
>>>> the number of map slots.
>>>>
>>>
>>> This usually (or might be) something the admin doesn't let us override.
>>> Also i am not sure if jvm reuse in hadoop is isolated between different
>>> jobs (so other tasks may inherit that stuff they probably don't want)
>>
>> The jvm is only reused during a single job.
>>
>>>>
>>>> -sebastian
>>>>
>>>> On 13.06.2013 21:56, Ted Dunning wrote:
>>>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]>
>>>> wrote:
>>>>>
>>>>>> Andy, note that he said he's running with a 1.6M-term dictionary.
>>>>  That's
>>>>>> going
>>>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices.
>>>> Still
>>>>>> not hitting
>>>>>> 8GB, but getting closer.
>>>>>>
>>>>>
>>>>> It will likely be even worse unless this table is shared between
>> mappers.
>>>>>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention
>>>> machine
>>>>> configuration, but this could easily cause swapping.
>>>>>
>>>>
>>>>
>>>
>>
>>
> 
> 

Reply via email to