Hi,

I realize MapReduce algorithms are not the "hot new stuff" anymore, but I
am playing around with LDA. I have some problems with the memory, can you
help me suggest how to set up parameters to make this work?

I am running on a virtual cluster on my laptop - two nodes with 3 GB of
memory each - just to prepare before I try this on a physical cluster with
much larger data set. I am using a data set of 500 documents, averaging
around 120 kB each, with roughly 60.000 terms. Running this with 20 topics
runs ok - but when running on 100 topics, I ran out of memory (on the
mappers). Can you suggest me how to set parameters, so it's going to run
more mappers that will consume less memory?

The error I get: Task Id : attempt_1457214584155_0074_m_000000_1, Status :
FAILED
*Container* [pid=26283,containerID=container_1457214584155_0074_01_000003] *is
running beyond physical memory limits. Current usage: 1.0 GB of 1 GB
physical memory used*; 1.7 GB of 2.1 GB virtual memory used. Killing
container.

This are the parameters I set for CVB0Driver:

static int numTopics = 100;
static double doc_topic_smoothening = 0.5;
static double term_topic_smoothening = 0.5;

static int maxIter = 3;
static int iteration_block_size = 10;
static double convergenceDelta = 0;
static float testFraction = 0.0f;
static int numTrainThreads = 4;
static int numUpdateThreads = 1;
static int maxItersPerDoc = 3;
static int numReduceTasks = 10;
static boolean backfillPerplexity = false;

Any suggestion? Should I enlarge the container size on Hadoop, or can
I fix this with LDA parameters?

Cheers,
David

Reply via email to