On 13.06.2013 22:12, Dmitriy Lyubimov wrote: > On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]> wrote: > >> This table is readonly, right? We could try to apply the trick from our >> ALS code: Instead of running one mapper per core (and thus having one >> copy of the table per core), run a multithreaded mapper and share the >> table between its threads. Works very well for ALS. > > > Just out of my ignorance, how will you tell MR that your mapper is using > more than 1 core and that it doesn't have to run more than 1 mapepr of > that time per box?
You need to use a MultithreadedMapper for which you can set the size of the thread pool via MultithreadedMapper.setNumberOfThreads(...) You can configure the maximum number of mappers to run per task tracker with -Dmapred.tasktracker.map.tasks.maximum=x > > >> We can also cache >> the table in a static variable and make Hadoop reuse JVMs, which >> increases performance if the number of blocks to process is larger than >> the number of map slots. >> > > This usually (or might be) something the admin doesn't let us override. > Also i am not sure if jvm reuse in hadoop is isolated between different > jobs (so other tasks may inherit that stuff they probably don't want) The jvm is only reused during a single job. >> >> -sebastian >> >> On 13.06.2013 21:56, Ted Dunning wrote: >>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]> >> wrote: >>> >>>> Andy, note that he said he's running with a 1.6M-term dictionary. >> That's >>>> going >>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. >> Still >>>> not hitting >>>> 8GB, but getting closer. >>>> >>> >>> It will likely be even worse unless this table is shared between mappers. >>> With 8 mappers per node, this goes to 41GB. The OP didn't mention >> machine >>> configuration, but this could easily cause swapping. >>> >> >> >
