Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Eustache DIEMERT
Hi Alex, If I understand correctly you are using 2 different kinds of features : categorical + ngrams. In a similar situation but in a classification setting a trick that worked reasonably well was to train two different models, one feeding the other. I.e. build a first model out of ngrams/nlp f

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Olivier Grisel
2013/4/24 Alex Kopp : > Thanks, guys. > > Perhaps I should explain what I am trying to do and then open it up for > suggestions. > > I have 203k training examples each with 457k features. The features are > composed of one-hot encoded categorical values as well as stemmed, TFIDF > weighted unigrams

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
Have you tried tuning the hyper-parameters of the SGDRegressor? You really need to tune the learning rate for SGDRegressor (SGDClassifier has a pretty decent default). E.g. set up a grid search w/ a constant learning rate and try different values of eta0 ([0.1, 0.01, 0.001, 0.0001]). You can also s

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Alex Kopp
Thanks, guys. Perhaps I should explain what I am trying to do and then open it up for suggestions. I have 203k training examples each with 457k features. The features are composed of one-hot encoded categorical values as well as stemmed, TFIDF weighted unigrams and bigrams (NLP). As you can proba

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
2013/4/24 Olivier Grisel > 2013/4/24 Peter Prettenhofer : > > I totally agree with Brian - although I'd suggest you drop option 3) > because > > it will be a lot of work. > > > > I'd suggest you rather should do a) feature extraction or b) feature > > selection. > > > > Personally, I think decisi

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Olivier Grisel
2013/4/24 Peter Prettenhofer : > I totally agree with Brian - although I'd suggest you drop option 3) because > it will be a lot of work. > > I'd suggest you rather should do a) feature extraction or b) feature > selection. > > Personally, I think decision trees in general and random forest in > pa

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer
I totally agree with Brian - although I'd suggest you drop option 3) because it will be a lot of work. I'd suggest you rather should do a) feature extraction or b) feature selection. Personally, I think decision trees in general and random forest in particular are not a good fit for sparse datase

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-23 Thread Brian Holt
At the moment your three options are 1) get more memory 2) do feature selection - 400k features on 200k samples seems to me to contain a lot of redundant information or irrelevant features 3) submit a PR to support dense matrices - this is going to be a lot of work and I doubt it's worth it. All t

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-23 Thread Juan Nunez-Iglesias
@Alex: I don't have a workaround for you but this seems like a useful addition. I don't know how hard it would be, but you should definitely raise it as an issue on the github issues page for the project: https://github.com/scikit-learn/scikit-learn/issues?sort=updated&state=open On Wed, Apr 24,

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-23 Thread Calvin Morrison
get more memory? On 23 April 2013 17:06, Alex Kopp wrote: > Hi, > > I am looking to build a random forest regression model with a pretty large > amount of sparse data. I noticed that I cannot fit the random forest model > with a sparse matrix. Unfortunately, a dense matrix is too large to fit in

[Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-23 Thread Alex Kopp
Hi, I am looking to build a random forest regression model with a pretty large amount of sparse data. I noticed that I cannot fit the random forest model with a sparse matrix. Unfortunately, a dense matrix is too large to fit in memory. What are my options? For reference, I have just over 400k fe