On 13.06.2013 22:12, Dmitriy Lyubimov wrote:
> On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> This table is readonly, right? We could try to apply the trick from our
>> ALS code: Instead of running one mapper per core (and thus having one
>> copy of the table per core), run a multithreaded mapper and share the
>> table between its threads. Works very well for ALS.
> 
> 
> Just out of my ignorance, how will you tell MR that your mapper is using
> more than 1  core and that it doesn't have to run more than 1 mapepr of
> that time per box?

You need to use a MultithreadedMapper for which you can set the size of
the thread pool via MultithreadedMapper.setNumberOfThreads(...)

You can configure the maximum number of mappers to run per task tracker
with -Dmapred.tasktracker.map.tasks.maximum=x

> 
> 
>> We can also cache
>> the table in a static variable and make Hadoop reuse JVMs, which
>> increases performance if the number of blocks to process is larger than
>> the number of map slots.
>>
> 
> This usually (or might be) something the admin doesn't let us override.
> Also i am not sure if jvm reuse in hadoop is isolated between different
> jobs (so other tasks may inherit that stuff they probably don't want)

The jvm is only reused during a single job.

>>
>> -sebastian
>>
>> On 13.06.2013 21:56, Ted Dunning wrote:
>>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]>
>> wrote:
>>>
>>>> Andy, note that he said he's running with a 1.6M-term dictionary.
>>  That's
>>>> going
>>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices.
>> Still
>>>> not hitting
>>>> 8GB, but getting closer.
>>>>
>>>
>>> It will likely be even worse unless this table is shared between mappers.
>>>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention
>> machine
>>> configuration, but this could easily cause swapping.
>>>
>>
>>
> 

Reply via email to