This table is readonly, right? We could try to apply the trick from our
ALS code: Instead of running one mapper per core (and thus having one
copy of the table per core), run a multithreaded mapper and share the
table between its threads. Works very well for ALS. We can also cache
the table in a static variable and make Hadoop reuse JVMs, which
increases performance if the number of blocks to process is larger than
the number of map slots.

-sebastian

On 13.06.2013 21:56, Ted Dunning wrote:
> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> wrote:
> 
>> Andy, note that he said he's running with a 1.6M-term dictionary.  That's
>> going
>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. Still
>> not hitting
>> 8GB, but getting closer.
>>
> 
> It will likely be even worse unless this table is shared between mappers.
>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention machine
> configuration, but this could easily cause swapping.
> 

Reply via email to