Re: [scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread Brown J.B. via scikit-learn
Dear Yang Li,

> Neither the classificationTree nor the regressionTree supports
categorical feature. That means the Decision trees model can only accept
continuous feature.

Consider either manually encoding your categories in bitstrings (e.g.,
"Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder
to do the same thing for you automatically.

Cheers,
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread 李扬
Hi, I`m a graduate student utilizing sklean for some data work. 
And when I`m handling the data using the Decision Trees library, I found there 
are some inconvenience:
Neither the classificationTree nor the regressionTree supports categorical 
feature. That means the Decision trees model can only accept continuous 
feature. 
For example, the categorical feature like app name such as google, facebook 
can`t be input into the model, because they can`t be transformed to continuous 
value properly. And there don`t exist a corresponding algorithm to divide 
discrete feature in the Decision Trees library.
However, the CART algorithm itself has considered the use of categorical 
feature. So I have made some modification of Decision Trees library based on 
CART and apply the new model on my own work.  And it proves that the support 
for categorical feature indeed improves the performance, which is very 
necessary for decision tree, I think.
I`m very willing to contribute this to sklearn community, but I`m new to this 
community, not so familiar about the procedure.
Could u give some suggestions or comments on this new feature? Or has anyone 
already processed on this feature? Thank you so much.


Best wishes!







--

顺颂时祺!




李扬 
上海交通大学  电子信息 与 电气工程 学院  
电话:18818212371
地址:上海市闵行区东川路800号
邮编:200240


Yang Li  +86 188 1821 2371
Shanghai Jiao Tong University
School of Electronic,Information and Electrical Engineering F1203026
800 Dongchuan Road, Minhang District, Shanghai 200240




 ___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2018-01-03 Thread Manuel Castejón Limas
I've read about Dask and it is a tool I want to have in my belt especially
for using the SGE connection in order to run GridSearchCV on the
supercomputer center I have access to. Should it work as promised it will
be one of my favs.

As far as my toy example I keep more limited goals with this graph: I am
not currently interested in parallelizing each step as I guess that
parallelizing each graph fit through gridSearchCV will be more similar to
what I need.

I keep working on a proof concept. You can have a look at:

https://github.com/mcasl/PAELLA/blob/master/pipeGraph.py

along with a few unitary tests:
https://github.com/mcasl/PAELLA/blob/master/tests/test_pipeGraph.py

As of today, I have an iterable graph of steps that can be fitted/run
depending on their role (some can be disable during run while active during
fit or vice-versa). I still have to play a bit with injecting different
parameters to make it compatible with gridSearchCV and learn a bit about
the memory options in order to cache results.

Any comments highly appreciated, truly!
Manolo




2017-12-30 15:34 GMT+01:00 Frédéric Bastien :

> This start to look as the dask project. Do you know it?
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-01-03 Thread Shiheng Duan
Yes, it is an efficient method, still, we need to specify the number of
clusters or the threshold. Is there another way to run hierarchy clustering
on the big dataset? The main problem is the distance matrix.
Thanks.

On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
wrote:

> Have you had a look at BIRCH?
>
> http://scikit-learn.org/stable/modules/clustering.html#birch
>
> --
> Olivier
> ​
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] pomegranate v0.9.0 released: probabilistic modeling for Python

2018-01-03 Thread Jacob Schreiber
Howdy all!

I'm pleased to announced the release of pomegranate v0.9.0. The focus of
this release is on missing value support across all model fitting /
structure learning / inference methods and models. This enables you to do
everything from fitting a multivariate Gaussian distribution to an
incomplete data set (using a GPU if desired!) to learning the structure of
a Bayesian network on an incomplete data set, to running Viterbi decoding
using a hidden Markov model on a sequence with some missing values.

Read more about it here: http://bit.ly/2CyrXtX

Thanks!
Jacob
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier as a feature selector

2018-01-03 Thread Maciek Wójcikowski
I agree with Gael on this one and am happy to help with the PR if you need
any assistance.

Best,
Maciek




Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2017-12-29 18:14 GMT+01:00 Gael Varoquaux :

> I think that a transform method would be good. We would have to add a
> parameter to the constructor to specify which layer is used for the
> transform. It should default to "-1", in my opinion.
>
> Cheers,
>
> Gaël
>
> Sent from my phone. Please forgive typos and briefness.
> On Dec 29, 2017, at 17:48, "Javier López"  wrote:
>
>> Hi Thomas,
>>
>> it is possible to obtain the activation values of any hidden layer, but
>> the
>> procedure is not completely straight forward. If you look at the code of
>> the `_predict` method of MLPs you can see the following:
>>
>> ```python
>> def _predict(self, X):
>> """Predict using the trained model
>>
>> Parameters
>> --
>> X : {array-like, sparse matrix}, shape (n_samples, n_features)
>> The input data.
>>
>> Returns
>> ---
>> y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
>> The decision function of the samples for each class in the
>> model.
>> """
>> X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
>>
>> # Make sure self.hidden_layer_sizes is a list
>> hidden_layer_sizes = self.hidden_layer_sizes
>> if not hasattr(hidden_layer_sizes, "__iter__"):
>> hidden_layer_sizes = [hidden_layer_sizes]
>> hidden_layer_sizes = list(hidden_layer_sizes)
>>
>> layer_units = [X.shape[1]] + hidden_layer_sizes + \
>> [self.n_outputs_]
>>
>> # Initialize layers
>> activations = [X]
>>
>> for i in range(self.n_layers_ - 1):
>> activations.append(np.empty((X.shape[0],
>>  layer_units[i + 1])))
>> # forward propagate
>> self._forward_pass(activations)
>> y_pred = activations[-1]
>>
>> return y_pred
>> ```
>>
>> the line `y_pred = activations[-1]` is responsible for extracting the
>> values for the last layer,
>> but the `activations` variable contains the values for all the neurons.
>>
>> You can make this function into your own external method (changing the
>> `self` attribute by
>> a proper parameter) and add an extra argument which specifies the
>> layer(s) that you want.
>> I have done this myself in order to make an AutoEncoderNetwork out of the
>> MLP
>> implementation.
>>
>> This makes me wonder, would it be worth adding this to sklearn?
>> A very simple way would be to refactor the `_predict` method, with the
>> additional layer
>> argument, to a new method `_predict_layer`, then we can have the
>> `_predict` method
>> simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps
>> a `transform`?)
>> that allows to get (raveled) values for an arbitrary subset of the layers.
>>
>> I'd be happy to submit a PR if you guys think it would be interesting for
>> the project.
>>
>> Javier
>>
>>
>>
>> On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis 
>> wrote:
>>
>>> Greetings,
>>>
>>> I want to train a MLPClassifier with one hidden layer and use it as a
>>> feature selector for an MLPRegressor.
>>> Is it possible to get the values of the neurons from the last hidden
>>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
>>>
>>> If it is not possible with scikit-learn, is anyone aware of any
>>> scikit-compatible NN library that offers this functionality? For example
>>> this one:
>>>
>>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>>
>>> I wouldn't like to do this in Tensorflow because the MLP there is much
>>> slower than scikit-learn's implementation.
>>>
>> --
>>
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn