Do they use the same value for the min_samples_split parameter? I see
they use a default value (hidden in their constructor I guess), but
theirs might not be the same as ours.
Gilles
On 28 November 2012 16:29, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
Am 28.11.2012 16:19, schrieb Peter
Nope they don't...
On 28 November 2012 16:39, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
Am 28.11.2012 16:33, schrieb Gilles Louppe:
Do they use the same value for the min_samples_split parameter? I see
they use a default value (hidden in their constructor I guess), but
theirs might
Thanks a lot for the quick responses and the suggestions. Unfortunately,
rebuilding the model every time a new version comes out is not an option
for me.
Well then, from a very practical point of view, do you need to upgrade
at all? Your model won't be any more accurate because you update.
For Trees, you could subsample and train trees on different
subsets but not sure how well this works if the subsets
are only a small fraction of the whole dataset.
This often works surprisingly well :)
(both along examples and features)
Hi Paul,
a) Scaling has no effect on decision trees.
b) You shouldn't set max_depth=5. Instead, build fully developed trees
(max_depth=None) or rather tune min_samples_split using
cross-validation.
Hope this helps.
Gilles
On 6 November 2012 16:21, paul.czodrow...@merckgroup.com wrote:
ear
Hi,
I know the speaker at pydata today claimed that the features are
partitioned,
Can you elaborate? If you pick your features prior to the construction
of the tree and then build it on that subset only, then indeed, this
is not random forest. That algorithm is called Random Subspaces.
Best,
Hi Siddhant,
This is not yet supported unfortunately.
Best,
Gilles
On 15 October 2012 17:50, Siddhant Goel siddhantg...@gmail.com wrote:
Hi people,
Does scikit-learn support plugging in user defined classifiers in its
ensemble learning framework? I went through the documentation but could
Hi Team,
Given the increasing maturity of the project, we have decided (or,
more precisely, I convinced my advisor :-)) to use Scikit-Learn in the
machine learning course given at my university. Our objective is to
make our students use Scikit-Learn for three assignments. We were
previously using
Hi,
The ensemble classes handle the problem you describe already. Have a look
at the implementation of predict_proba of BaseForestClassifier in
ensemble.py if you want to do that yourself by hand.
Hope this helps.
Gilles
On Wednesday, 26 September 2012, Mathieu Blondel math...@mblondel.org
@Doug: Sorry I was typing my previous response from my phone.
The snippet of code that I was talking about can be found at:
https://github.com/glouppe/scikit-learn/blob/master/sklearn/ensemble/forest.py#L93
Cheers,
Gilles
On Wednesday, 26 September 2012, Gilles Louppe g.lou...@gmail.com wrote
I'm basically looking to take pre-trained classifiers and allows you
to combine the predicted probabilities in custom ways, like favoring
some classifiers over others, etc.
Not that RandomForests™ are not useful--they could be the building
block classifiers in such a system.
@Oliver's
Hi Christian,
The score method does not play any role in fit.
Are you sure the RF classifier is the same in both case? (have you set
the random state to the same value?)
Can you provide some code in any case?
Thanks,
Gilles
On 21 September 2012 20:45, Christian Jauvin cjau...@gmail.com
+1
On 2 September 2012 14:16, Alexandre Gramfort
alexandre.gramf...@inria.fr wrote:
sounds good to me especially since you volunteer to do it :)
Alex
On Sun, Sep 2, 2012 at 2:10 PM, Andreas Mueller
amuel...@ais.uni-bonn.de wrote:
Hey everybody.
I noticed in the last couple of month that
Hi Peter,
At least we are better than Weka! More seriously, this indeed shows
that there is still a lot of work to do... :(
Gilles
On 27 August 2012 09:06, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
Hi folks,
I just stumbled upon this benchmark comparing wiserf, R randomForest,
Hi,
I am indeed leaving for holiday very soon and will be disconnected
until mid-Augustus.
My personal wish list is short:
- #986: A full lazy argsort implementation of the tree construction algorithm.
- #941: Tree post-pruning
I plan to work on both at my return. #941 shouldn't take much time,
Hi,
What version of scikit-learn are you using? 0.11 or dev?
Best,
Gilles
On 19 July 2012 06:34, Shankar Satish mailsh...@yahoo.co.in wrote:
Hello everyone,
I have a custom prediction class which in fact consists of a random forest
regressor+classifier. The class implements a fit() method,
Since it's in Brussels, I think I should be there as well :)
I can also help with something around scikit-learn if needed.
Gilles
On 30 March 2012 10:31, Vincent Michel vm.mic...@gmail.com wrote:
I think that I will be there too.
2012/3/30 Alexandre Gramfort alexandre.gramf...@inria.fr
I
Hi,
I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.
Gilles
On 27 March 2012 11:56, Paolo Losi paolo.l...@gmail.com wrote:
Thanks Peter,
On Tue,
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of
Hi Satrajit,
Adding more trees should never hurt accuracy. The more, the better.
Since you have a lot of irrelevant features, I'll advise to increase
max_features in order to capture the relevant features when computing
the random splits. Otherwise, your trees will indeed fit on noise.
Best,
Hi,
You can inject your fit params using the `fit_params` parameter in GridSearchCV.
Gilles
On 3 February 2012 13:59, Mathias Verbeke mathi...@gmail.com wrote:
Hi Andreas,
You would have to add it to the fit method of SVC, not GridSearchCV.
How can this be done in the digits example,
Yes indeed, as I said at the time, much of the forest code could be
reused to implement a pure averaging meta-estimator.
The main thing that makes BaseForest tree-specific is that it
precomputes X_argsorted such that it is computed only once for all
trees and inject it into the fit method of the
Yep, I think that your solution would work Olivier. I am buzy this week-end
but I can push a first draft of this refactoring by the beginning of next
week.
Gilles
On Saturday, 21 January 2012, Olivier Grisel olivier.gri...@ensta.org
wrote:
2012/1/20 Andreas amuel...@ais.uni-bonn.de:
On
It is converted to Fortran order for efficiency reasons. The most
repeated and consuming operation is the search for split thresholds,
which is performed column-wise, hence the Fortran ordering.
Gilles
On 10 January 2012 09:39, Andreas amuel...@ais.uni-bonn.de wrote:
Hey everybody.
Looking a
Well, not everyone is using modern architectures ;)
On 10 January 2012 10:43, Andreas amuel...@ais.uni-bonn.de wrote:
On 01/10/2012 10:22 AM, Gilles Louppe wrote:
@both: This might be a stupid question but is there really so much
difference
in indexing continuously or with stride over a C
The current code works great for me (thanks for contributing),
still it would mean a lot if I could make it even faster. At the moment
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use I have a 128 core cluster here but then
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2) overfit.
Gilles
PS: I have just started to change to doc. Expect a PR later today :)
On 3 January 2012 09:27, Andreas
.
Thanks! Will try that.
Also thanks for working on the docs! :)
Cheers,
Andy
On 01/03/2012 09:30 AM, Gilles Louppe wrote:
Hi Andras,
Try setting min_split=10 or higher. With a dataset of that size, there
is no point in using min_split=1, you will 1) consume indeed too much
memory and 2
Hi Andy!
1)
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
The narrative docs say that max_features=n_features is a good value for
RandomForests.
As far as I know, Breiman 2001 suggests max_features =
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just
Hi list,
This is a call to get an additional person (or more) to review the
pending PR #491 on parallel forest of trees.
It has already been reviewed by @ogrisel and look ready to merged for
the both of us, but an additional review would be more than welcome!
It seems to be an interesting tool to me. We need to find a
non-trivial overfitting example that would run in an acceptable time
with the datasets available in the scikit.
Actually, those curves can be plot with respect to any parameter, not
only the training set size.
What comes to me is to
I suggest that we use the following conventions:
* PRs that are not ready to be merged should be named 'WIP: ...' (for
'Work In Progress')
* PRs that are ready to be merged, or more accurately, for which the
contributors feel that they are ready to be merged, should be renamed
to
Hi list,
During the sprint, I plan to review @pprett pull request on Gradient
Tree Boosting. It is also my intention to implement parallel
construction and prediction of forest of trees.
I also have some ideas concerning the tree module, like computing
variable importance (which is already
Hi Gael,
Actually, it would be great if everybody who showed interest in the past,
or who is now interested could send me an email so that I have a clear
view of who is coming when, to make the bookings.
Due to my changes of plans, I will arrive at NIPS on Thursday 15. I
will come at the
Hi list,
Just to let the admins know.
It's been a few days I have been trying to access our buildbot web
page (http://buildbot.afpy.org/scikit-learn/) but the service seems to
be always unavailable.
Gilles
--
All the
Good job Nelle! Thank you :)
Gilles
On 24 November 2011 21:17, Olivier Grisel olivier.gri...@ensta.org wrote:
2011/11/24 Nelle Varoquaux nelle.varoqu...@gmail.com:
The buildbot is back online !
Thanks!
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Hi list,
I would like to ask for comments on the forests of randomized trees
pull request that I have been working on for the past few weeks. I
think it is ready for merge.
This pull request is the first in scikit-learn to concern ensemble
methods and includes two important tree-based algorithms
Upgrading sphinx seems to solve the problem of missing docstring
reference for functions, it should be already in the webpage.
Great!
Gilles
--
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
Booking a guest house is a great idea! But do you intend to book such
a house for NIPS and the sprint, or for the sprint only? In
particular, I was concerned about commuting to the Sierra Nevada
during the workshops if the house was in Granada.
GIlles
On 5 November 2011 19:05, Olivier Grisel
What I have in mind is to have the house for NIPS and for the sprint, but
to have a gap in between during the workshop.
We are going to call them today, so if you want in for one or both of the
periods, please keep us posted.
I am in, for both periods!
Thanks
Gilles
I have just submitted a PR to brian's branch :)
On 4 November 2011 11:13, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
Gilles,
I was not aware of your work in _tree.pyx. Looks great! Still, I
didn't touch any line in `find_best_split` so the merging/rebase
should be quite
ranks = np.argsort(np.sum(estimator.coef_ ** 2, axis=0))
My question is: Why the summation of the squared weight matrix is used?
What is the logic behind it?
This is used for handling estimators that assign several weights to
the same feature. Indeed, if several weights are assigned to a each
import cPickle
for i in range(0, 20):
with open(forest%d.pkl % (i), 'r') as f:
start = datetime.now()
a = cPickle.load(f)
print 'loaded ', i, datetime.now() - start
produce these run-time results
loaded 0 0:00:14.952436
loaded 1
101 - 144 of 144 matches
Mail list logo