I'm doing a POC of LDA in Mahout on a dataset of about 500000 documents and with 1.6 million unique terms (document length is highly variable, up to a few thousand unique terms per document).
The weirdest behaviour I'm seeing is that the multithreaded training Map task only utilizes one core on an eight core node. I'm not sure if this is configurable in the JVM parameters or the job config. In the meantime I've set the input split very small, so that I can run 8 parallel 1-thread training mappers per node. Should I be configuring this differently? I also wanted to check in and verify that the performance I'm seeing is typical: - on a six-node cluster (48 map slots, 8 cores per node) running full tilt, each iteration takes about 7 hours. I assume the problem is just that our cluster is far too small, and that the performance will scale if I make the splits even smaller and distribute the job across more nodes. - with an 8GB heap size I can't exceed about 200 topics before running out of heap space. I tried making the Map input smaller, but that didn't seem to help. Can someone describe how memory usage scales per mapper in terms of topics, documents and terms? Thanks -- Alan Gardner Solutions Architect - CTO Office [email protected] | LinkedIn: http://www.linkedin.com/profile/view?id=65508699 | @alanctgardner<https://twitter.com/alanctgardner> Tel: +1 613 565 8696 x1218 Mobile: +1 613 897 5655 -- --
