Re: [Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-12 Thread Gilles Louppe
Hi Rex, This is currently not supported in scikit-learn. Gilles On 12 September 2015 at 05:02, Rex X wrote: > Given categorical attributes, for instance > city = ['a', 'b', 'c', 'd', 'e', 'f'] > > With DictVectorizer(), we can transform "city" into a sparse matrix, using > 1-of-k representation

Re: [Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-12 Thread Jacob Schreiber
Hi Rex As Gilles said, this currently is not supported in sklearn. It may be possible to do this with post processing, by checking to see if child splits produced the same result. aka if city == 'a' return 0, else if city == 'b' return 0 else 1 (a simple 2 node decision tree) can be merged into if

Re: [Scikit-learn-general] GridSearchCV over OneVsRest wrapping SVC

2015-09-12 Thread Andy
On 09/12/2015 02:54 AM, Daan Wynen wrote: Hi Andy, according to [1] "The multiclass support is handled according to a one-vs-one scheme." That's why I was using the wrapper. SVC has one-vs-rest built-in. What should the docs say? The multiclass docs here are quite explicit: http://scikit-learn

Re: [Scikit-learn-general] GridSearchCV over OneVsRest wrapping SVC

2015-09-12 Thread Michael Eickenberg
On Saturday, September 12, 2015, Andy wrote: > On 09/12/2015 02:54 AM, Daan Wynen wrote: > > Hi Andy, > > according to [1] "The multiclass support is handled according to a > one-vs-one scheme." > That's why I was using the wrapper. > > SVC has one-vs-rest built-in. > actually, it has one-vs-one

Re: [Scikit-learn-general] GridSearchCV over OneVsRest wrapping SVC

2015-09-12 Thread Andy
On 09/12/2015 01:25 PM, Michael Eickenberg wrote: On Saturday, September 12, 2015, Andy > wrote: On 09/12/2015 02:54 AM, Daan Wynen wrote: Hi Andy, according to [1] "The multiclass support is handled according to a one-vs-one scheme." That's why

Re: [Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-12 Thread Rex X
Gills and Jacob, Thanks for the answer! Best, Rex On Sat, Sep 12, 2015 at 8:31 AM, Jacob Schreiber wrote: > Hi Rex > > As Gilles said, this currently is not supported in sklearn. It may be > possible to do this with post processing, by checking to see if child > splits produced the same resul

[Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. Currently I use multiprocessing module of Python to boost the speed. But this only works for one node, while the

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Gilles Louppe
Hi, > But the question is how to make the scikit-learn code, decisionTree Regressor > for example, running in distributed computing mode, to benefit the power of > Spark? I am sorry but you cant. The tree implementation in scikit-learn was not designed for this use case. Maybe you should have

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Jacob Schreiber
As a side note, multithreaded single decision tree training is something on our radar. It may be possible that afterwards we work towards supporting distributed training, but I wouldn't count on it for a while. On Sat, Sep 12, 2015 at 10:18 AM, Gilles Louppe wrote: > Hi, > > > But the question

[Scikit-learn-general] Sprint in Paris, October 19th to 23rd

2015-09-12 Thread Nelle Varoquaux
Hello scikit-learners, We are organizing a sprint in Paris in october. If you plan to attend, please add yourself to the list of attendees on the wiki [1]. Please specify whether you need funding and/or accommodation as well. We will do our best to find funding for as many people as possible. If y

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
This project looks interesting https://github.com/lensacom/sparkit-learn and a nice coded project name :) On Sat, Sep 12, 2015 at 11:24 AM, Jacob Schreiber wrote: > As a side note, multithreaded single decision tree training is

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Andreas Mueller
On 09/12/2015 04:56 PM, Rex X wrote: This project looks interesting https://github.com/lensacom/sparkit-learn and a nice coded project name :) In sparkit-learn, the learning either happens on a single machine, or separate mo

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Sebastian Raschka
Interesting! Is it (sparkit-learn) a Python wrapper for the Spark Scala code (e.g., like PySpark & Mlib) or is it running scikit-learn Python code on distributed systems? > On Sep 12, 2015, at 8:54 PM, Andreas Mueller wrote: > > > > On 09/12/2015 04:56 PM, Rex X wrote: >> This project looks

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Manoj Kumar
It seems to me that it is the latter. A quick example is the predict method of SparkLogisticRegression ( https://github.com/lensacom/sparkit-learn/blob/master/splearn/linear_model/logistic.py#L139 ) The input is an ArrayRDD which is a wrapper around the RDD in spark, but with numpy-like operation