On Thu, Jun 13, 2013 at 9:43 AM, Andy Schlaikjer < [email protected]> wrote:
> Hi Alan, > > On Thu, Jun 13, 2013 at 8:54 AM, Alan Gardner <[email protected]> wrote: > > > The weirdest behaviour I'm seeing is that the multithreaded training Map > > task only utilizes one core on an eight core node. I'm not sure if this > is > > configurable in the JVM parameters or the job config. In the meantime > I've > > set the input split very small, so that I can run 8 parallel 1-thread > > training mappers per node. Should I be configuring this differently? > > > > At my office it's generally frowned upon to run MR tasks which attempt to > make use of lots of cores on a multicore system, due to cluster > configuration which forces number of map / reduce slots to sum to num > cores. If multiple multi-threaded task attempts run on the same node, CPU > load may spike and negatively affect performance of all task attempts on > the node. > > > > I also wanted to check in and verify that the performance I'm seeing is > > typical: > > > > - on a six-node cluster (48 map slots, 8 cores per node) running full > tilt, > > each iteration takes about 7 hours. I assume the problem is just that our > > cluster is far too small, and that the performance will scale if I make > the > > splits even smaller and distribute the job across more nodes. > > > > How many input splits are generated for your input doc-term matrix? In each > task attempt, how many rows are processed? Make sure input is balanced > across all map tasks. > > > > - with an 8GB heap size I can't exceed about 200 topics before running > out > > of heap space. I tried making the Map input smaller, but that didn't seem > > to help. Can someone describe how memory usage scales per mapper in terms > > of topics, documents and terms? > > > > The tasks need memory proportional to num topics x num terms. Do you have a > full 8 GB heap for each task slot? > Andy, note that he said he's running with a 1.6M-term dictionary. That's going to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. Still not hitting 8GB, but getting closer. Do you really need 1.6M terms? With only 500k documents, you're probably using a lot of terms which only occur 1-3 times throughout the corpus. If you take terms which occur at least 5 times, you'll probably drop your dict size by an order of magnitude, without much loss of usefulness. > > Cheers, > Andy > > Twitter, Inc. > -- -jake
