Hi,
I am running LDA on 18k documents, each document has 5k terms. total
300k terms. Topics is set to 100.
Running LDA on Hadoop single node configuration takes about 5 hours per
stage. And 20 stages would take 100 hours.
However, given 20 machines, running on Amazon EMR is actually much much
slower. It takes 1000 minutes per stage. (It takes about 10 minutes for
1% mapping progress.) Reducing is much faster is counted in seconds,
almost neglect-able.
Does anyone has similar experience or my setup is wrong?
Chris