This table is readonly, right? We could try to apply the trick from our ALS code: Instead of running one mapper per core (and thus having one copy of the table per core), run a multithreaded mapper and share the table between its threads. Works very well for ALS. We can also cache the table in a static variable and make Hadoop reuse JVMs, which increases performance if the number of blocks to process is larger than the number of map slots.
-sebastian On 13.06.2013 21:56, Ted Dunning wrote: > On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> wrote: > >> Andy, note that he said he's running with a 1.6M-term dictionary. That's >> going >> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. Still >> not hitting >> 8GB, but getting closer. >> > > It will likely be even worse unless this table is shared between mappers. > With 8 mappers per node, this goes to 41GB. The OP didn't mention machine > configuration, but this could easily cause swapping. >
