Yep Jake. Just go through the paper and shell scripts to call the algorithm very quickly. It is not a map-reduce implementation of LDA, but just used the hadoop dfs and use the mapper to run as a parallel program.
But I thought, the idea is very very useful, especially to those iterative machine learning algorithm, in many algorithm, we might have lots of iterations to run, and the hadoop mapreduce job will have a lots of overhead. Like the gibbs sampling in LDA, if an mapreduce have 1 minute overhead in setup and cleanup the job, 15 hours will be spent on the overhead, which is not acceptable I guess. But if we could use the advantages of the communication layer in this implementation, for many iterative algorithm, we have have 1 iteration in mapreduce and get tall the iteration happened inside the only 1 mapper iteration. Our dev team are working on a parallelized L-BFGS logistic regression these days. Every mapper will read only the local data, and the global weight is updated once in a map-reduce iteration. Normally, it would take 30-50 iteration to converge, if we could use the similar implementation with this LDA implementation to eliminate the 1-2 hour overhead at least. And I agree that, the current solution is far beyond a generic framework based on hadoop, but really valuable to take a look, and might be very valuable to migrate to the hadoop or mahout. Best wishes, Stanley Xu On Fri, Jun 10, 2011 at 12:49 PM, Jake Mannix <[email protected]> wrote: > It's all c++, custom distributed processing, custom distributed > coordination > and storage. > > We can certainly try to port over the algorithmic ideas, but the > distributed > systems stuff would be a significant departure from our current setup - > it's > not a web service and it's not hadoop, and it's not a command line utility > - > it's a cluster of long-running processes all intercommunicating. Sounds > awesome, but that's a way's off from where we are now. > > -jake > > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <[email protected]> wrote: > > > Awesome! Guess it would be much faster than then current version in > Mahout. > > Is that possible to just use this version in mahout? > > > > On Fri, Jun 10, 2011 at 8:12 AM, <[email protected]> wrote: > > > > > Yahoo released its hadoop code for LDA > > > > > > > > > http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation > > > > > > > > > > > > > > > > > >
