can anyone give me a sample algorithm for one hot encoding used in
scikit-learn?
On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
you can try an ordinal encoding instead - just map each categorical value
to an integer so that you end up with 8 numerical
? you already use one-hot encoding in your example (
preprocessing.OneHotEncoder)
2013/6/21 Maheshakya Wijewardena pmaheshak...@gmail.com
can anyone give me a sample algorithm for one hot encoding used in
scikit-learn?
On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer
I'd like to analyse a bit and encode using that method to cohere with
random forests in scikit-learn.
On Fri, Jun 21, 2013 at 2:08 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
? you already use one-hot encoding in your example (
preprocessing.OneHotEncoder)
2013/6/21
What do you mean? It's pretty trivial to implement a one-hot encoding, the
issue is that if you use a non-sparse format then you'll end up with a
matrix which is far too dense to be practical, for anything but trivial
examples.
On Fri, Jun 21, 2013 at 10:46 AM, Maheshakya Wijewardena
Hi,
I'm new to scikit-learn. I'm trying use preprocessing.OneHotEncoder to
encode my training and test data. After encoding I tried to train Random
forest classifier using that data. But I get the following error when
fitting.
(Here the error trace)
99 model.fit(X_train, y_train)100
Hi,
seems like your sparse matrix is too large to be converted to a dense
matrix. What shape does X have? How many categorical variables do you have
(before applying the OneHotTransformer)?
--
This SF.net email is
Hi Maheshakya,
It's probably right: your feature space is too big and sparse to be
reasonable for random forests. What sort of categorical data are you
encoding? What is the shape of the matrix after applying one-hot encoding?
If you need to use random forests, and not a method that natively
Hi,
This looks like the dataset from the Amazon challenge currently
running on Kaggle. When one-hot-encoded, you end up with rhoughly
15000 binary features, which means that the dense representation
requires at least 32000*15000*4 bytes to hold in memory (or even twice
as as more depending on
What is the cardinality of each feature?
--
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
___
Scikit-learn-general
2013/6/20 Gilles Louppe g.lou...@gmail.com:
This looks like the dataset from the Amazon challenge currently
running on Kaggle. When one-hot-encoded, you end up with rhoughly
15000 binary features, which means that the dense representation
requires at least 32000*15000*4 bytes to hold in memory
2013/6/20 Olivier Grisel olivier.gri...@ensta.org:
Actually twice as much, even on a 32-bit platform (float size is
always 64 bits).
The decision tree code always uses 32 bits floats:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
but you have to cast
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
2013/6/20 Olivier Grisel olivier.gri...@ensta.org:
Actually twice as much, even on a 32-bit platform (float size is
always 64 bits).
The decision tree code always uses 32 bits floats:
So Maheshakya's `toarray` might work with
`X.astype(np.float32).toarray('F')`...
(But by might work I mean won't throw a ValueError...)
On Thu, Jun 20, 2013 at 11:56 PM, Olivier Grisel
olivier.gri...@ensta.orgwrote:
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
2013/6/20 Gilles Louppe
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
2013/6/20 Gilles Louppe g.lou...@gmail.com:
This looks like the dataset from the Amazon challenge currently
running on Kaggle. When one-hot-encoded, you end up with rhoughly
15000 binary features, which means that the dense representation
requires
The shape of X after encoding is (32769, 16600). Seems as if that is too
big to be converted into a dense matrix. Can Random forest handle this
amount of features?
On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel olivier.gri...@ensta.orgwrote:
2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
And yes Gilles, It is the Amazon challenge :D
On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena
pmaheshak...@gmail.com wrote:
The shape of X after encoding is (32769, 16600). Seems as if that is too
big to be converted into a dense matrix. Can Random forest handle this
amount of
you can try an ordinal encoding instead - just map each categorical value
to an integer so that you end up with 8 numerical features - if you use
enough trees and grow them deep it may work
2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com
And yes Gilles, It is the Amazon challenge :D
17 matches
Mail list logo