Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-21 Thread Maheshakya Wijewardena
can anyone give me a sample algorithm for one hot encoding used in scikit-learn? On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: you can try an ordinal encoding instead - just map each categorical value to an integer so that you end up with 8 numerical

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-21 Thread Peter Prettenhofer
? you already use one-hot encoding in your example ( preprocessing.OneHotEncoder) 2013/6/21 Maheshakya Wijewardena pmaheshak...@gmail.com can anyone give me a sample algorithm for one hot encoding used in scikit-learn? On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-21 Thread Maheshakya Wijewardena
I'd like to analyse a bit and encode using that method to cohere with random forests in scikit-learn. On Fri, Jun 21, 2013 at 2:08 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: ? you already use one-hot encoding in your example ( preprocessing.OneHotEncoder) 2013/6/21

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-21 Thread federico vaggi
What do you mean? It's pretty trivial to implement a one-hot encoding, the issue is that if you use a non-sparse format then you'll end up with a matrix which is far too dense to be practical, for anything but trivial examples. On Fri, Jun 21, 2013 at 10:46 AM, Maheshakya Wijewardena

[Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
Hi, I'm new to scikit-learn. I'm trying use preprocessing.OneHotEncoder to encode my training and test data. After encoding I tried to train Random forest classifier using that data. But I get the following error when fitting. (Here the error trace) 99 model.fit(X_train, y_train)100

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer
Hi, seems like your sparse matrix is too large to be converted to a dense matrix. What shape does X have? How many categorical variables do you have (before applying the OneHotTransformer)? -- This SF.net email is

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Joel Nothman
Hi Maheshakya, It's probably right: your feature space is too big and sparse to be reasonable for random forests. What sort of categorical data are you encoding? What is the shape of the matrix after applying one-hot encoding? If you need to use random forests, and not a method that natively

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Gilles Louppe
Hi, This looks like the dataset from the Amazon challenge currently running on Kaggle. When one-hot-encoded, you end up with rhoughly 15000 binary features, which means that the dense representation requires at least 32000*15000*4 bytes to hold in memory (or even twice as as more depending on

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
What is the cardinality of each feature? -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Lars Buitinck
2013/6/20 Gilles Louppe g.lou...@gmail.com: This looks like the dataset from the Amazon challenge currently running on Kaggle. When one-hot-encoded, you end up with rhoughly 15000 binary features, which means that the dense representation requires at least 32000*15000*4 bytes to hold in memory

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Lars Buitinck
2013/6/20 Olivier Grisel olivier.gri...@ensta.org: Actually twice as much, even on a 32-bit platform (float size is always 64 bits). The decision tree code always uses 32 bits floats: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 but you have to cast

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl: 2013/6/20 Olivier Grisel olivier.gri...@ensta.org: Actually twice as much, even on a 32-bit platform (float size is always 64 bits). The decision tree code always uses 32 bits floats:

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Joel Nothman
So Maheshakya's `toarray` might work with `X.astype(np.float32).toarray('F')`... (But by might work I mean won't throw a ValueError...) On Thu, Jun 20, 2013 at 11:56 PM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl: 2013/6/20 Gilles Louppe

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl: 2013/6/20 Gilles Louppe g.lou...@gmail.com: This looks like the dataset from the Amazon challenge currently running on Kaggle. When one-hot-encoded, you end up with rhoughly 15000 binary features, which means that the dense representation requires

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
The shape of X after encoding is (32769, 16600). Seems as if that is too big to be converted into a dense matrix. Can Random forest handle this amount of features? On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
And yes Gilles, It is the Amazon challenge :D On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: The shape of X after encoding is (32769, 16600). Seems as if that is too big to be converted into a dense matrix. Can Random forest handle this amount of

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer
you can try an ordinal encoding instead - just map each categorical value to an integer so that you end up with 8 numerical features - if you use enough trees and grow them deep it may work 2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com And yes Gilles, It is the Amazon challenge :D