2012/10/25 Nikit Saraf <[email protected]>: > Hi > > I'm fairly new to the field of Machine Learning and as a result new user of > scikit-learn. I'm learning about the Map Reduce parallel implementation of > Machine Learning Algorithms in python. So I was thinking of ways to > MapReduce the cross-validation. Anyone having any ideas on how to translate > the cross-validation to MapReduce would be heartily welcomed.
Cross Validation is embarrassingly parallel. You don't really need a reduce stage (except for averaging scores across CV folds). You could send a different random seed to each mapper, let each of them subsample the data with something akin to a scalable / streaming version of one step of StratifiedShuffleSplit and then stream those to an online learning algorithm (with a loop around calls to partial_fit on chunks of samples that fit in the memory of one mapper. The mappers need to manage a direct access to the distributed file system instead of receiving a stream of (k, v) pairs provided by the Job Manager. But basically this would be working around the design of the MapReduce framework to gain not much IMHO. The new YARN infrastructure of Hadoop would be much more amenable to host a CV computation application instead. I would recommend you to have a look at http://arxiv.org/pdf/1209.2191.pdf for some practical applications of mapreduce to machine learning (the model averaging use case is interesting IMHO). See also this previous thread on the mailing list: http://comments.gmane.org/gmane.comp.python.scikit-learn/2960 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
