Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Vlad Niculae
On Jan 3, 2012, at 17:02 , Olivier Grisel wrote: > 2012/1/3 Lars Buitinck : >> >>> We probably need to extend the sklearn.feature_extraction.text package >>> to make it more user friendly to work with with pure categorical >>> features occurrences: >> >> I'm not sure this belongs in feature_ext

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Olivier Grisel
2012/1/3 Lars Buitinck : > >> We probably need to extend the sklearn.feature_extraction.text package >> to make it more user friendly to work with with pure categorical >> features occurrences: > > I'm not sure this belongs in feature_extraction.text; it's much more > broadly applicable. > > If you

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Lars Buitinck
2011/12/30 Olivier Grisel : > Alright, then the name of this kind of features is "categorical > features" in machine learning jargon: the string is used as an > identifier and the ordered sequence of letters is not exploited by the > model. On the opposite "string features" means something very spe

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Lars Buitinck
2011/12/30 Bronco Zaurus : > One more way would be computing classification probability for each value > and plugging the resulting number back into data. For example, let's say > there are 10 samples with BMW in the training set, and 3 of them are 1 > (true), 7 are 0 (false). So the maximum likeli

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Peter Prettenhofer
2012/1/3 Andreas : > Hi. > I just switched to DecisionTreeClassifier to make analysis easier. > There should be no joblib there, right? correct. > One thing I noticed is that there is often > ``np.argsort(X.T, axis=1).astype(np.int32)`` > which always does a copy. This is done once at the beginn

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
I noticed one more thing in the random forest code: The random forests averages the probabilities in the leaves. This is in contrast to Breiman 2001, where trees vote with hard class decisions afaik. As far as I can tell, that is not documented. Has anyone tried both methods and @glouppe: why did

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
Hi. I just switched to DecisionTreeClassifier to make analysis easier. There should be no joblib there, right? One thing I noticed is that there is often ``np.argsort(X.T, axis=1).astype(np.int32)`` which always does a copy. Still, as these should be garbage collected, I don't really see where all

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Peter Prettenhofer
2012/1/3 Gael Varoquaux : > On Tue, Jan 03, 2012 at 09:52:28AM +0100, Peter Prettenhofer wrote: >> I just checked DecisionTreeClassifier - it basically requires the same >> amout of memory for its internal data structures (= `X_sorted` which >> is also 60.000 x 786 x 4 bytes). > > Would it be an op

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Gael Varoquaux
On Tue, Jan 03, 2012 at 09:52:28AM +0100, Peter Prettenhofer wrote: > I just checked DecisionTreeClassifier - it basically requires the same > amout of memory for its internal data structures (= `X_sorted` which > is also 60.000 x 786 x 4 bytes). Would it be an option to allow sorting in place to

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Peter Prettenhofer
Hi, I just checked DecisionTreeClassifier - it basically requires the same amout of memory for its internal data structures (= `X_sorted` which is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but you have to make sure that joblib does not fork a new process. If so, the new process

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
Thanks for the help, Peter! As I guess the memory error is before the actual construction of the tree, the `min_density=1` didn't help. I'll try going go another box but when I have time I'll try to dig more into the code. Cheers, Andy On 01/03/2012 09:41 AM, Peter Prettenhofer wrote: > Hi Andy,

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Gilles Louppe
Note also that when using bootstrap=True, copies of X have to be created for each tree. But this should work anyway since you only build 1 tree... Hmmm. Gilles On 3 January 2012 09:41, Peter Prettenhofer wrote: > Hi Andy, > > I'll investigate the issue with an artificial dataset of comparable >

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Peter Prettenhofer
Hi Andy, I'll investigate the issue with an artificial dataset of comparable size - to be honest I suspect that we focused on speed at the cost of memory usage... As a quick fix you could set `min_density=1` which will result in less memory copies at the cost of runtime. best, Peter 2012/1/3 A

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
I also get the same error when using max_depth=1. It's here: File "/home/amueller/checkout/scikit-learn/sklearn/tree/tree.py", line 357, in _build_tree np.argsort(X.T, axis=1).astype(np.int32).T) The parameters of my forest are: RandomForestClassifier(bootstrap=True, compute_importances=False,

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
Hi Gilles. Thanks! Will try that. Also thanks for working on the docs! :) Cheers, Andy On 01/03/2012 09:30 AM, Gilles Louppe wrote: > Hi Andras, > > Try setting min_split=10 or higher. With a dataset of that size, there > is no point in using min_split=1, you will 1) consume indeed too much > m

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Gilles Louppe
Hi Andras, Try setting min_split=10 or higher. With a dataset of that size, there is no point in using min_split=1, you will 1) consume indeed too much memory and 2) overfit. Gilles PS: I have just started to change to doc. Expect a PR later today :) On 3 January 2012 09:27, Andreas wrote: > H

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
Hi Brian. The dataset itself is 6 * 786 * 8 bytes (I converted from unit8 to float which is 8 bytes in Numpy I guess) which is ~ 360 MB (also I can load it ;). I trained linear SVMs and Neural networks without much trouble. I haven't really studied the decision tree code (which I know you mad

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread bdholt1
Hi Andy, IIRC MNIST is 6 samples, each with dimension 28x28, so the 2GB limit doesn't seem unreasonable (especially since you don't have all of that at your disposal). Does the dataset fit in mem? Brian -Original Message- From: Andreas Date: Tue, 03 Jan 2012 09:00:47 To: Reply-

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Andreas
One other question: I tried to run a forest on MNIST, that actually consisted of only one tree. That gave me a memory error. I only have 2gb ram in this machine (this is my desktop at IST Austria !?) which is obviously not that much. Still this kind of surprised me. Is it expected that a tree takes