On Jan 3, 2012, at 17:02 , Olivier Grisel wrote:
> 2012/1/3 Lars Buitinck :
>>
>>> We probably need to extend the sklearn.feature_extraction.text package
>>> to make it more user friendly to work with with pure categorical
>>> features occurrences:
>>
>> I'm not sure this belongs in feature_ext
2012/1/3 Lars Buitinck :
>
>> We probably need to extend the sklearn.feature_extraction.text package
>> to make it more user friendly to work with with pure categorical
>> features occurrences:
>
> I'm not sure this belongs in feature_extraction.text; it's much more
> broadly applicable.
>
> If you
2011/12/30 Olivier Grisel :
> Alright, then the name of this kind of features is "categorical
> features" in machine learning jargon: the string is used as an
> identifier and the ordered sequence of letters is not exploited by the
> model. On the opposite "string features" means something very spe
2011/12/30 Bronco Zaurus :
> One more way would be computing classification probability for each value
> and plugging the resulting number back into data. For example, let's say
> there are 10 samples with BMW in the training set, and 3 of them are 1
> (true), 7 are 0 (false). So the maximum likeli
2012/1/3 Andreas :
> Hi.
> I just switched to DecisionTreeClassifier to make analysis easier.
> There should be no joblib there, right?
correct.
> One thing I noticed is that there is often
> ``np.argsort(X.T, axis=1).astype(np.int32)``
> which always does a copy.
This is done once at the beginn
I noticed one more thing in the random forest code:
The random forests averages the probabilities in the leaves.
This is in contrast to Breiman 2001, where trees vote with
hard class decisions afaik.
As far as I can tell, that is not documented.
Has anyone tried both methods and @glouppe:
why did
Hi.
I just switched to DecisionTreeClassifier to make analysis easier.
There should be no joblib there, right?
One thing I noticed is that there is often
``np.argsort(X.T, axis=1).astype(np.int32)``
which always does a copy.
Still, as these should be garbage collected, I don't really see
where all
2012/1/3 Gael Varoquaux :
> On Tue, Jan 03, 2012 at 09:52:28AM +0100, Peter Prettenhofer wrote:
>> I just checked DecisionTreeClassifier - it basically requires the same
>> amout of memory for its internal data structures (= `X_sorted` which
>> is also 60.000 x 786 x 4 bytes).
>
> Would it be an op
On Tue, Jan 03, 2012 at 09:52:28AM +0100, Peter Prettenhofer wrote:
> I just checked DecisionTreeClassifier - it basically requires the same
> amout of memory for its internal data structures (= `X_sorted` which
> is also 60.000 x 786 x 4 bytes).
Would it be an option to allow sorting in place to
Hi,
I just checked DecisionTreeClassifier - it basically requires the same
amout of memory for its internal data structures (= `X_sorted` which
is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but
you have to make sure that joblib does not fork a new process. If so,
the new process
Thanks for the help, Peter!
As I guess the memory error is before the actual construction
of the tree, the `min_density=1` didn't help.
I'll try going go another box but when I have time I'll try
to dig more into the code.
Cheers,
Andy
On 01/03/2012 09:41 AM, Peter Prettenhofer wrote:
> Hi Andy,
Note also that when using bootstrap=True, copies of X have to be
created for each tree.
But this should work anyway since you only build 1 tree... Hmmm.
Gilles
On 3 January 2012 09:41, Peter Prettenhofer
wrote:
> Hi Andy,
>
> I'll investigate the issue with an artificial dataset of comparable
>
Hi Andy,
I'll investigate the issue with an artificial dataset of comparable
size - to be honest I suspect that we focused on speed at the cost of
memory usage...
As a quick fix you could set `min_density=1` which will result in less
memory copies at the cost of runtime.
best,
Peter
2012/1/3 A
I also get the same error when using max_depth=1.
It's here:
File "/home/amueller/checkout/scikit-learn/sklearn/tree/tree.py", line
357, in _build_tree
np.argsort(X.T, axis=1).astype(np.int32).T)
The parameters of my forest are:
RandomForestClassifier(bootstrap=True, compute_importances=False,
Hi Gilles.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
On 01/03/2012 09:30 AM, Gilles Louppe wrote:
> Hi Andras,
>
> Try setting min_split=10 or higher. With a dataset of that size, there
> is no point in using min_split=1, you will 1) consume indeed too much
> m
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
On 3 January 2012 09:27, Andreas wrote:
> H
Hi Brian.
The dataset itself is 6 * 786 * 8 bytes (I converted from unit8 to
float which is 8 bytes in Numpy I guess)
which is ~ 360 MB (also I can load it ;).
I trained linear SVMs and Neural networks without much trouble. I
haven't really studied the
decision tree code (which I know you mad
Hi Andy,
IIRC MNIST is 6 samples, each with dimension 28x28, so the 2GB limit
doesn't seem unreasonable (especially since you don't have all of that at your
disposal). Does the dataset fit in mem?
Brian
-Original Message-
From: Andreas
Date: Tue, 03 Jan 2012 09:00:47
To:
Reply-
One other question:
I tried to run a forest on MNIST, that actually consisted of only one tree.
That gave me a memory error. I only have 2gb ram in this machine
(this is my desktop at IST Austria !?) which is obviously not that much.
Still this kind of surprised me. Is it expected that a tree takes
19 matches
Mail list logo