Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Jacob Schreiber
As a side note, multithreaded single decision tree training is something on our radar. It may be possible that afterwards we work towards supporting distributed training, but I wouldn't count on it for a while. On Sat, Sep 12, 2015 at 10:18 AM, Gilles Louppe wrote: > Hi, >

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
This project looks interesting https://github.com/lensacom/sparkit-learn and a nice coded project name :) On Sat, Sep 12, 2015 at 11:24 AM, Jacob Schreiber wrote: > As a side note, multithreaded single

[Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. Currently I use multiprocessing module of Python to boost the speed. But this only works for one node, while

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Gilles Louppe
Hi, > But the question is how to make the scikit-learn code, decisionTree Regressor > for example, running in distributed computing mode, to benefit the power of > Spark? I am sorry but you cant. The tree implementation in scikit-learn was not designed for this use case. Maybe you should have

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Andreas Mueller
On 09/12/2015 04:56 PM, Rex X wrote: This project looks interesting https://github.com/lensacom/sparkit-learn and a nice coded project name :) In sparkit-learn, the learning either happens on a single machine, or separate

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Sebastian Raschka
Interesting! Is it (sparkit-learn) a Python wrapper for the Spark Scala code (e.g., like PySpark & Mlib) or is it running scikit-learn Python code on distributed systems? > On Sep 12, 2015, at 8:54 PM, Andreas Mueller wrote: > > > > On 09/12/2015 04:56 PM, Rex X wrote: >>

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Manoj Kumar
It seems to me that it is the latter. A quick example is the predict method of SparkLogisticRegression ( https://github.com/lensacom/sparkit-learn/blob/master/splearn/linear_model/logistic.py#L139 ) The input is an ArrayRDD which is a wrapper around the RDD in spark, but with numpy-like