About the last question: it probably has something to do with setting the max iterations and max iterations per document to the same value ... What is the "number of iterations per document" really doing?
--David On Thu, Mar 10, 2016 at 5:39 PM, David Starina <david.star...@gmail.com> wrote: > There is one more weird thing I can not understand ... > > When running only one iteration of LDA, the iteration took 88 seconds. > When running 20 iterations with exactly the same code, on the same > documents, same parameters ... it took 8683 seconds - which is 434 seconds > per iteration. Is there something I don't understand about this algorithm? > Why would one iteration take that much longer just because you run more of > iterations? > > --David > > On Thu, Mar 10, 2016 at 2:24 PM, David Starina <david.star...@gmail.com> > wrote: > >> How does memory requirement grow with the number of topics? A little >> experimentation shows me that number of documents doesn't matter as much as >> the number of topics ... Does the memory requirement grow exponentially >> with the number of topics? >> >> --David >> >> On Thu, Mar 10, 2016 at 11:43 AM, David Starina <david.star...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I realize MapReduce algorithms are not the "hot new stuff" anymore, but >>> I am playing around with LDA. I have some problems with the memory, can you >>> help me suggest how to set up parameters to make this work? >>> >>> I am running on a virtual cluster on my laptop - two nodes with 3 GB of >>> memory each - just to prepare before I try this on a physical cluster with >>> much larger data set. I am using a data set of 500 documents, averaging >>> around 120 kB each, with roughly 60.000 terms. Running this with 20 topics >>> runs ok - but when running on 100 topics, I ran out of memory (on the >>> mappers). Can you suggest me how to set parameters, so it's going to run >>> more mappers that will consume less memory? >>> >>> The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status >>> : FAILED >>> *Container* >>> [pid=26283,containerID=container_1457214584155_0074_01_000003] *is >>> running beyond physical memory limits. Current usage: 1.0 GB of 1 GB >>> physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing >>> container. >>> >>> This are the parameters I set for CVB0Driver: >>> >>> static int numTopics = 100; >>> static double doc_topic_smoothening = 0.5; >>> static double term_topic_smoothening = 0.5; >>> >>> static int maxIter = 3; >>> static int iteration_block_size = 10; >>> static double convergenceDelta = 0; >>> static float testFraction = 0.0f; >>> static int numTrainThreads = 4; >>> static int numUpdateThreads = 1; >>> static int maxItersPerDoc = 3; >>> static int numReduceTasks = 10; >>> static boolean backfillPerplexity = false; >>> >>> Any suggestion? Should I enlarge the container size on Hadoop, or can I fix >>> this with LDA parameters? >>> >>> Cheers, >>> David >>> >>> >> >