Re: [Scikit-learn-general] Python MapReduce

Olivier Grisel Thu, 25 Oct 2012 00:40:11 -0700

2012/10/25 Nikit Saraf <[email protected]>:
> Hi
>
> I'm fairly new to the field of Machine Learning and as a result new user of
> scikit-learn. I'm learning about the Map Reduce parallel implementation of
> Machine Learning Algorithms in python. So I was thinking of ways to
> MapReduce the cross-validation. Anyone having any ideas on how to translate
> the cross-validation to MapReduce would be heartily welcomed.


Cross Validation is embarrassingly parallel. You don't really need a
reduce stage (except for averaging scores across CV folds). You could
send a different random seed to each mapper, let each of them
subsample the data with something akin to a scalable / streaming
version of one step of StratifiedShuffleSplit and then stream those to
an online learning algorithm (with a loop around calls to partial_fit
on chunks of samples that fit in the memory of one mapper. The mappers
need to manage a direct access to the distributed file system instead
of receiving a stream of (k, v) pairs provided by the Job Manager.

But basically this would be working around the design of the MapReduce
framework to gain not much IMHO. The new YARN infrastructure of Hadoop
would be much more amenable to host a CV computation application
instead.

I would recommend you to have a look at
http://arxiv.org/pdf/1209.2191.pdf for some practical applications of
mapreduce to machine learning (the model averaging use case is
interesting IMHO).

See also this previous thread on the mailing list:
http://comments.gmane.org/gmane.comp.python.scikit-learn/2960

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Python MapReduce

Reply via email to