Stanley, The short answer is that this is a real problem.
Try this: *Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010, June 2010. Or this http://www.iterativemapreduce.org/ http://code.google.com/p/haloop/ You may be interested in experimenting with MapReduce 2.0. THat allows more flexibility in execution model: http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/ Systems like FlumeJava (and my open source, incomplete clone Plume) may help with flexibility: http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <[email protected]> wrote: > Dear All, > > Our team is trying to implement a parallelized LDA with Gibbs Sampling. We > are using the algorithm mentioned by plda, http://code.google.com/p/plda/ > > The problem is that by the Map-Reduce method the paper mentioned. We need > to > run a MapReduce job every gibbs sampling iteration, and normally, it will > use 1000 - 2000 iterations per our test with our data to converge. But as > we > know, there is a cost to re-create the mapper/reducer, and cleanup the > mapper/reducer in every iteration. It will take about 40 seconds on our > cluster per our test, and 1000 iteration means almost 12 hours. > > I am wondering if there is a way to reduce the cost of Mapper/Reducer > setup/cleanup, since I prefer to have all the mappers to read the same > local > data and update the local data in a mapper process. All the other update it > need comes from the reducer which is a pretty small data compare to the > whole dataset. > > Is there any approach I could try(including change part of hadoop's source > code.)? > > > Best wishes, > Stanley Xu >
