Hi, I realize MapReduce algorithms are not the "hot new stuff" anymore, but I am playing around with LDA. I have some problems with the memory, can you help me suggest how to set up parameters to make this work?
I am running on a virtual cluster on my laptop - two nodes with 3 GB of memory each - just to prepare before I try this on a physical cluster with much larger data set. I am using a data set of 500 documents, averaging around 120 kB each, with roughly 60.000 terms. Running this with 20 topics runs ok - but when running on 100 topics, I ran out of memory (on the mappers). Can you suggest me how to set parameters, so it's going to run more mappers that will consume less memory? The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status : FAILED *Container* [pid=26283,containerID=container_1457214584155_0074_01_000003] *is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing container. This are the parameters I set for CVB0Driver: static int numTopics = 100; static double doc_topic_smoothening = 0.5; static double term_topic_smoothening = 0.5; static int maxIter = 3; static int iteration_block_size = 10; static double convergenceDelta = 0; static float testFraction = 0.0f; static int numTrainThreads = 4; static int numUpdateThreads = 1; static int maxItersPerDoc = 3; static int numReduceTasks = 10; static boolean backfillPerplexity = false; Any suggestion? Should I enlarge the container size on Hadoop, or can I fix this with LDA parameters? Cheers, David