Re: [Scikit-learn-general] random forest string data

2013-07-31 Thread Oğuz Yarımtepe
{"word": vocabulary[word], ...} the trained data is lie [[0.0, 1.0, 'xxx', 'yyy', '13.0', ...], ] so when i use DictVectorizer it will create an array when i run fit_transform somethign like array([[ 1., 0.], [ 0., 1.]]) with different shape and data. I am not sure how i will repla

Re: [Scikit-learn-general] random forest string data

2013-07-31 Thread Lars Buitinck
2013/7/31 Oğuz Yarımtepe : > How will i use DictVectorizer for string values above? It won't do categorical integer coding directly. You can keep a separate dict of the string values, say vocabulary, then feed DictVectorizer dicts of the form {"word": vocabulary[word], ...} -- Lars Buitinck

Re: [Scikit-learn-general] random forest string data

2013-07-31 Thread Oğuz Yarımtepe
On Mon, Jul 29, 2013 at 12:19 AM, Ross Boucher wrote: > Interesting, I've been using DictVectorizer (and one hot coded categorical > data) with Random Forests and getting decent results. Is this just > coincidental, and will I see better results if I combine the categorical > data into a single c

Re: [Scikit-learn-general] random forest string data

2013-07-31 Thread Oğuz Yarımtepe
Hi, > What you get from DictVectorizer is a sparse matrix containing one-hot > coded categorical values (booleans). Random forests don't support > those, but fortunately they (should) handle categorical values without > one-hot coding, so you do something like > > I tried with string values and

Re: [Scikit-learn-general] random forest string data

2013-07-29 Thread Lars Buitinck
2013/7/28 Ross Boucher : > Interesting, I've been using DictVectorizer (and one hot coded categorical > data) with Random Forests and getting decent results. Is this just > coincidental, and will I see better results if I combine the categorical > data into a single column? The thing is that dense

Re: [Scikit-learn-general] random forest string data

2013-07-29 Thread Olivier Grisel
If the cardinality of the categorical variable is not too big, the output of the DictVectorizer should be ok if you first convert it to a dense numpy array ( by calling `.toarray()` on the CSR instance). -- See everything

Re: [Scikit-learn-general] random forest string data

2013-07-28 Thread Ross Boucher
Interesting, I've been using DictVectorizer (and one hot coded categorical data) with Random Forests and getting decent results. Is this just coincidental, and will I see better results if I combine the categorical data into a single column? On Sun, Jul 28, 2013 at 9:06 AM, Lars Buitinck wrote:

Re: [Scikit-learn-general] random forest string data

2013-07-28 Thread Lars Buitinck
2013/7/28 Oğuz Yarımtepe : > I had read the scikit preprocessing issues and it seems i shoudl have used > DictVectoricer to encode my categorical string values after i put them in a > dict format. But i am not sure how i will use the resulting output at the > random forest code. What you get from

[Scikit-learn-general] random forest string data

2013-07-28 Thread Oğuz Yarımtepe
Hi, I am trying to use random forest for my dataset that includes string values also. The dataset that i used for training is a csv file but includes some string categorical values also. I had read the scikit preprocessing issues and it seems i shoudl have used DictVectoricer to encode my categor