How does memory requirement grow with the number of topics? A little experimentation shows me that number of documents doesn't matter as much as the number of topics ... Does the memory requirement grow exponentially with the number of topics?
--David On Thu, Mar 10, 2016 at 11:43 AM, David Starina <david.star...@gmail.com> wrote: > Hi, > > I realize MapReduce algorithms are not the "hot new stuff" anymore, but I > am playing around with LDA. I have some problems with the memory, can you > help me suggest how to set up parameters to make this work? > > I am running on a virtual cluster on my laptop - two nodes with 3 GB of > memory each - just to prepare before I try this on a physical cluster with > much larger data set. I am using a data set of 500 documents, averaging > around 120 kB each, with roughly 60.000 terms. Running this with 20 topics > runs ok - but when running on 100 topics, I ran out of memory (on the > mappers). Can you suggest me how to set parameters, so it's going to run > more mappers that will consume less memory? > > The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status : > FAILED > *Container* > [pid=26283,containerID=container_1457214584155_0074_01_000003] *is > running beyond physical memory limits. Current usage: 1.0 GB of 1 GB > physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing > container. > > This are the parameters I set for CVB0Driver: > > static int numTopics = 100; > static double doc_topic_smoothening = 0.5; > static double term_topic_smoothening = 0.5; > > static int maxIter = 3; > static int iteration_block_size = 10; > static double convergenceDelta = 0; > static float testFraction = 0.0f; > static int numTrainThreads = 4; > static int numUpdateThreads = 1; > static int maxItersPerDoc = 3; > static int numReduceTasks = 10; > static boolean backfillPerplexity = false; > > Any suggestion? Should I enlarge the container size on Hadoop, or can I fix > this with LDA parameters? > > Cheers, > David > >