Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-04 Thread Dale T Smith
Search for Jackknife at Wikipedia. That will give you a quick overview. Then 
you will have the background to read the papers below.



While you are at Wikipedia, you may want to read on the bootstrap and random 
forests as well.



__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | 
dale.t.sm...@macys.com

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
Ibrahim Dalal via scikit-learn
Sent: Tuesday, October 4, 2016 6:44 AM
To: Scikit-learn user and developer mailing list
Cc: Ibrahim Dalal
Subject: Re: [scikit-learn] Random Forest with Bootstrapping

⚠ EXT MSG:
Hi,
So why is using a bootstrap sample of size n better than just a random set of 
size 0.62*n in Random Forest?
Thanks

On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka 
> wrote:
Originally, it was this technique was used to estimate a sampling distribution. 
Think of the drawing with replacement as work-around for generating *new* data 
from a population that is simulated by this repeated sampling from the given 
dataset with replacement.


For more details, I’d recommend reading the original literature, e.g,.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The 
Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1–26.


There’s also a whole book on this topic:

Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. 
Chapman & Hall.


Or more relevant to this particular application, maybe see

Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

"Tests on real and simulated data sets using classification and regression 
trees and subset selection in linear regression show that bagging can give 
substantial gains in accuracy. The vital element is the instability of the 
prediction method. If perturbing the learning set can cause significant changes 
in the predictor constructed, then bagging can improve accuracy."


> On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn 
> > wrote:
>
> So what is the point of having duplicate entries in your training set? This 
> seems just a pure overhead. Sorry but you will again have to help me here.
>
> On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka 
> > wrote:
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt 
> > though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an 
> > original dataset of size 100. A bootstrap sample (say, B) of size 100 is 
> > drawn from this set. Since 32 of the original samples are left out 
> > (theoretically at least), some of the samples in B must be repeated?
>
> Yeah, you'll definitely have duplications, that’s why (if you have an 
> infinitely large n) only 0.632*n samples are unique ;).
>
> Say your dataset is
>
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of 
> your data points)
>
> then a bootstrap sample could be
>
> [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 
> 3, 6, 8]
>
>
> > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn 
> > > wrote:
> >
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt 
> > though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an 
> > original dataset of size 100. A bootstrap sample (say, B) of size 100 is 
> > drawn from this set. Since 32 of the original samples are left out 
> > (theoretically at least), some of the samples in B must be repeated?
> >
> > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka 
> > > wrote:
> > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., 
> > via
> >
> > import matplotlib.pyplot as plt
> >
> > vs = []
> > for n in range(5, 201, 5):
> > v = 1 - (1. - 1./n)**n
> > vs.append(v)
> >
> > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> >   markersize=6,
> >   alpha=0.5,)
> >
> > plt.xlabel('n')
> > plt.ylabel('1 - (1 - 1/n)^n')
> > plt.xlim([0, 210])
> > plt.show()
> >
> > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka 
> > > > wrote:
> > >
> > > Say the probability that a given sample from a dataset of size n is *not* 
> > > drawn as a bootstrap sample is
> > >
> > > P(not_chosen) = (1 - 1\n)^n
> > >
> > > Since you have a 1/n chance to draw a particular sample (since 
> > > bootstrapping involves drawing with replacement), which you repeat n 
> > > times to get a n-sized bootstrap sample.
> > >
> > > This is 

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-04 Thread Ibrahim Dalal via scikit-learn
Hi,

So why is using a bootstrap sample of size n better than just a random set
of size 0.62*n in Random Forest?

Thanks

On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka 
wrote:

> Originally, it was this technique was used to estimate a sampling
> distribution. Think of the drawing with replacement as work-around for
> generating *new* data from a population that is simulated by this repeated
> sampling from the given dataset with replacement.
>
>
> For more details, I’d recommend reading the original literature, e.g,.
>
> Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.”
> The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1–26.
>
>
> There’s also a whole book on this topic:
>
> Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the
> Bootstrap. Chapman & Hall.
>
>
> Or more relevant to this particular application, maybe see
>
> Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.
>
> "Tests on real and simulated data sets using classification and regression
> trees and subset selection in linear regression show that bagging can give
> substantial gains in accuracy. The vital element is the instability of the
> prediction method. If perturbing the learning set can cause significant
> changes in the predictor constructed, then bagging can improve accuracy."
>
>
> > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
> >
> > So what is the point of having duplicate entries in your training set?
> This seems just a pure overhead. Sorry but you will again have to help me
> here.
> >
> > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka 
> wrote:
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> >
> > Yeah, you'll definitely have duplications, that’s why (if you have an
> infinitely large n) only 0.632*n samples are unique ;).
> >
> > Say your dataset is
> >
> > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices
> of your data points)
> >
> > then a bootstrap sample could be
> >
> > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently
> [2, 3, 6, 8]
> >
> >
> > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
> > >
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> > >
> > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <
> se.rasc...@gmail.com> wrote:
> > > Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
> > >
> > > import matplotlib.pyplot as plt
> > >
> > > vs = []
> > > for n in range(5, 201, 5):
> > > v = 1 - (1. - 1./n)**n
> > > vs.append(v)
> > >
> > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> > >   markersize=6,
> > >   alpha=0.5,)
> > >
> > > plt.xlabel('n')
> > > plt.ylabel('1 - (1 - 1/n)^n')
> > > plt.xlim([0, 210])
> > > plt.show()
> > >
> > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka 
> wrote:
> > > >
> > > > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> > > >
> > > > P(not_chosen) = (1 - 1\n)^n
> > > >
> > > > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> > > >
> > > > This is asymptotically "1/e approx. 0.368” (i.e., for very, very
> large n)
> > > >
> > > > Then, you can compute the probability of a sample being chosen as
> > > >
> > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> Thank you for the reply. Please bear with me for a while.
> > > >>
> > > >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> > > >>
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.rasc...@gmail.com> wrote:
> > > >>> From whatever 

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-04 Thread Joel Nothman
Congratulations, Raghav! Thanks for your dedication, as a student and
mentor in GSoC, but at all other times too!

On 4 October 2016 at 19:14, Jaques Grobler  wrote:

> Congrats Raghav!
>
> 2016-10-03 21:25 GMT+02:00 Andreas Mueller :
>
>> Congrats, hope to see lot's more ;)
>>
>>
>> On 10/03/2016 12:09 PM, Raghav R V wrote:
>>
>> Thanks everyone! Looking forward to contributing more :D
>>
>> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose 
>> wrote:
>>
>>> congrats! :)
>>>
>>> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen 
>>> wrote:
>>>
 Congrats, Raghav!

 Nelson Liu  於 2016年10月3日 週一 下午11:27寫道:

> Yay! Congrats, Raghav!
>
> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:
>
> Hi,
>
> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> (@raghavrv) has been working on scikit-learn for more than a year. In
> particular, he implemented the rewrite of the cross-validation
> utilities,
> which is quite dear to my heart.
>
> Welcome Raghav!
>
> Gaël
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn


>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> ___
>> scikit-learn mailing 
>> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn