Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-04 Thread Dale T Smith
via scikit-learn Sent: Tuesday, October 4, 2016 6:44 AM To: Scikit-learn user and developer mailing list Cc: Ibrahim Dalal Subject: Re: [scikit-learn] Random Forest with Bootstrapping ⚠ EXT MSG: Hi, So why is using a bootstrap sample of size n better than just a random set of size 0.62*n

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-04 Thread Ibrahim Dalal via scikit-learn
Hi, So why is using a bootstrap sample of size n better than just a random set of size 0.62*n in Random Forest? Thanks On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka wrote: > Originally, it was this technique was used to estimate a sampling > distribution. Think of

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement. For more details, I’d recommend

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Ibrahim Dalal via scikit-learn
So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here. On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka wrote: > > Hi, > > > > That helped a lot. Thank you very much. I have

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
> Hi, > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an > original dataset of size 100. A bootstrap sample (say, B) of size 100 is > drawn from this set. Since 32 of the original

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Ibrahim Dalal via scikit-learn
Hi, That helped a lot. Thank you very much. I have one more (silly?) doubt though. Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via import matplotlib.pyplot as plt vs = [] for n in range(5, 201, 5): v = 1 - (1. - 1./n)**n vs.append(v) plt.plot([n for n in range(5, 201, 5)], vs, marker='o', markersize=6, alpha=0.5,)

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is P(not_chosen) = (1 - 1\n)^n Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Алексей Драль
Hi, >From docs http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html : The RandomForestClassifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations z_i = (x_i, y_i). The out-of-bag (OOB) error is the

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
> From whatever little knowledge I gained last night about Random Forests, each > tree is trained with a sub-sample of original dataset (usually with > replacement)?. Yes, that should be correct! > Now, what I am not able to understand is - if entire dataset is used to train > each of the