Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
I don't think we can deny this is strange, certainly for real-world, IID
data!

On 13 April 2016 at 10:31, Juan Nunez-Iglesias  wrote:

> Yes but would you expect sampling 280K / 3M to be qualitatively different
> from sampling 70K / 3M?
>
> At any rate I'll attempt a more rigorous test later this week and report
> back. Thanks!
>
> Juan.
>
> On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman 
> wrote:
>
>> It's hard to believe this is a software problem rather than a data
>> problem. If your data was accidentally a duplicate of the dataset, you
>> could certainly get 100%.
>>
>> On 13 April 2016 at 10:10, Juan Nunez-Iglesias 
>> wrote:
>>
>>> Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
>>> Andy! =)
>>>
>>> However, I don't think data distribution can explain the result, since
>>> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
>>> random samples but changes to perfect classification for 280K samples. I
>>> don't have the data on this computer so I can't test it right now, though.
>>>
>>> Juan.
>>>
>>> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller 
>>> wrote:
>>>
 Have you tried to "score" the grid-search on the non-training set?
 The cross-validation is using stratified k-fold while your confirmation
 used the beginning of the dataset vs the rest.
 Your data is probably not IID.



 On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:

 Hi all,

 TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
 samples (280K), it falsely shows accuracy of 1.0 for full trees
 (max_depth=None). This doesn't happen for fewer samples.

 Longer version:

 I'm trying to optimise RF hyperparameters using GridSearchCV for the
 first time. I have a lot of data (~3M samples, 140 features), so I
 subsampled it to do this. First I subsampled to 3000 samples, which
 finished in 5min, so I ran 70K samples to see if result would still hold.
 This resulted in completely different parameter choices, so I ran 280K
 samples overnight, to see whether at least the choices would stabilise as n
 -> inf. Then when I printed the top 10 models, I got the following:

 In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
 key=lambda x: x
 [1])

 In [8]: bests[:10]
 Out[8]:
 [mean: 1.0, std: 0.0, params: {'n_estimators': 500,
 'bootstrap': True, '
 max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 500,
 'bootstrap': True, '
 max_features': 5, 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
 'bootstrap': True, '
 max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
 'bootstrap': True, '
 max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
 'bootstrap': True, '
 max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
 False, '
 max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 100,
 'bootstrap': False,
 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
 False, '
 max_features': 5, 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 100,
 'bootstrap': False,
 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
  mean: 1.0, std: 0.0, params: {'n_estimators': 500,
 'bootstrap': False,
 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]

 Needless to say, perfect accuracy is suspicious, and indeed, in this
 case, completely spurious:

 In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
 'bootstr
 ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
 'gini'})

 In [17]: rftop.fit(X[:20], y[:20])

 In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
 Out[20]: 0.826125

 That's more in line with what's expected for this dataset, and what was
 found by the search with 72K samples (top model: [mean: 0.82640, std:
 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
 20, 'max_depth': 20, 'criterion': 'gini'},)

 Anyway, here's my code, any idea why more samples would cause this
 overfitting / testing on training data?

 # [omitted: boilerplate to load full data in X0, y0]
 import numpy as np
 idx = 

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
Yes but would you expect sampling 280K / 3M to be qualitatively different
from sampling 70K / 3M?

At any rate I'll attempt a more rigorous test later this week and report
back. Thanks!

Juan.

On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman 
wrote:

> It's hard to believe this is a software problem rather than a data
> problem. If your data was accidentally a duplicate of the dataset, you
> could certainly get 100%.
>
> On 13 April 2016 at 10:10, Juan Nunez-Iglesias  wrote:
>
>> Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
>> Andy! =)
>>
>> However, I don't think data distribution can explain the result, since
>> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
>> random samples but changes to perfect classification for 280K samples. I
>> don't have the data on this computer so I can't test it right now, though.
>>
>> Juan.
>>
>> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller 
>> wrote:
>>
>>> Have you tried to "score" the grid-search on the non-training set?
>>> The cross-validation is using stratified k-fold while your confirmation
>>> used the beginning of the dataset vs the rest.
>>> Your data is probably not IID.
>>>
>>>
>>>
>>> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>>>
>>> Hi all,
>>>
>>> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>>> samples (280K), it falsely shows accuracy of 1.0 for full trees
>>> (max_depth=None). This doesn't happen for fewer samples.
>>>
>>> Longer version:
>>>
>>> I'm trying to optimise RF hyperparameters using GridSearchCV for the
>>> first time. I have a lot of data (~3M samples, 140 features), so I
>>> subsampled it to do this. First I subsampled to 3000 samples, which
>>> finished in 5min, so I ran 70K samples to see if result would still hold.
>>> This resulted in completely different parameter choices, so I ran 280K
>>> samples overnight, to see whether at least the choices would stabilise as n
>>> -> inf. Then when I printed the top 10 models, I got the following:
>>>
>>> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
>>> key=lambda x: x
>>> [1])
>>>
>>> In [8]: bests[:10]
>>> Out[8]:
>>> [mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>>> True, '
>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>>> True, '
>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>>> True, '
>>> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>>> True, '
>>> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>>> True, '
>>> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>>> False, '
>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>>> False,
>>> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>>> False, '
>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>>> False,
>>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>>> False,
>>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
>>>
>>> Needless to say, perfect accuracy is suspicious, and indeed, in this
>>> case, completely spurious:
>>>
>>> In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
>>> 'bootstr
>>> ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
>>> 'gini'})
>>>
>>> In [17]: rftop.fit(X[:20], y[:20])
>>>
>>> In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
>>> Out[20]: 0.826125
>>>
>>> That's more in line with what's expected for this dataset, and what was
>>> found by the search with 72K samples (top model: [mean: 0.82640, std:
>>> 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
>>> 20, 'max_depth': 20, 'criterion': 'gini'},)
>>>
>>> Anyway, here's my code, any idea why more samples would cause this
>>> overfitting / testing on training data?
>>>
>>> # [omitted: boilerplate to load full data in X0, y0]
>>> import numpy as np
>>> idx = np.random.choice(len(y0), size=28, replace=False)
>>> X, y = X0[idx], y0[idx]
>>> param_dist = {'n_estimators': [20, 100, 200, 500],
>>>   'max_depth': [3, 5, 20, None],
>>>   'max_features': ['auto', 5, 10, 20],
>>>   'bootstrap': [True, False],
>>>  

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
It's hard to believe this is a software problem rather than a data problem.
If your data was accidentally a duplicate of the dataset, you could
certainly get 100%.

On 13 April 2016 at 10:10, Juan Nunez-Iglesias  wrote:

> Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy!
> =)
>
> However, I don't think data distribution can explain the result, since
> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
> random samples but changes to perfect classification for 280K samples. I
> don't have the data on this computer so I can't test it right now, though.
>
> Juan.
>
> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller  wrote:
>
>> Have you tried to "score" the grid-search on the non-training set?
>> The cross-validation is using stratified k-fold while your confirmation
>> used the beginning of the dataset vs the rest.
>> Your data is probably not IID.
>>
>>
>>
>> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>>
>> Hi all,
>>
>> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>> samples (280K), it falsely shows accuracy of 1.0 for full trees
>> (max_depth=None). This doesn't happen for fewer samples.
>>
>> Longer version:
>>
>> I'm trying to optimise RF hyperparameters using GridSearchCV for the
>> first time. I have a lot of data (~3M samples, 140 features), so I
>> subsampled it to do this. First I subsampled to 3000 samples, which
>> finished in 5min, so I ran 70K samples to see if result would still hold.
>> This resulted in completely different parameter choices, so I ran 280K
>> samples overnight, to see whether at least the choices would stabilise as n
>> -> inf. Then when I printed the top 10 models, I got the following:
>>
>> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
>> key=lambda x: x
>> [1])
>>
>> In [8]: bests[:10]
>> Out[8]:
>> [mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> True, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> True, '
>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>> False, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>> False,
>> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>> False, '
>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>> False,
>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> False,
>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
>>
>> Needless to say, perfect accuracy is suspicious, and indeed, in this
>> case, completely spurious:
>>
>> In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
>> 'bootstr
>> ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
>> 'gini'})
>>
>> In [17]: rftop.fit(X[:20], y[:20])
>>
>> In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
>> Out[20]: 0.826125
>>
>> That's more in line with what's expected for this dataset, and what was
>> found by the search with 72K samples (top model: [mean: 0.82640, std:
>> 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
>> 20, 'max_depth': 20, 'criterion': 'gini'},)
>>
>> Anyway, here's my code, any idea why more samples would cause this
>> overfitting / testing on training data?
>>
>> # [omitted: boilerplate to load full data in X0, y0]
>> import numpy as np
>> idx = np.random.choice(len(y0), size=28, replace=False)
>> X, y = X0[idx], y0[idx]
>> param_dist = {'n_estimators': [20, 100, 200, 500],
>>   'max_depth': [3, 5, 20, None],
>>   'max_features': ['auto', 5, 10, 20],
>>   'bootstrap': [True, False],
>>   'criterion': ['gini', 'entropy']}
>> from sklearn import grid_search as gs
>> from time import time
>> from sklearn import ensemble
>> rf = ensemble.RandomForestClassifier()
>> random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
>> verbose=2, n_jobs=12)
>> start=time(); random_search.fit(X, y); stop=time()
>>
>> Thank you!
>>
>> Juan.

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy!
=)

However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.

Juan.

On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller  wrote:

> Have you tried to "score" the grid-search on the non-training set?
> The cross-validation is using stratified k-fold while your confirmation
> used the beginning of the dataset vs the rest.
> Your data is probably not IID.
>
>
>
> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>
> Hi all,
>
> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
> samples (280K), it falsely shows accuracy of 1.0 for full trees
> (max_depth=None). This doesn't happen for fewer samples.
>
> Longer version:
>
> I'm trying to optimise RF hyperparameters using GridSearchCV for the first
> time. I have a lot of data (~3M samples, 140 features), so I subsampled it
> to do this. First I subsampled to 3000 samples, which finished in 5min, so
> I ran 70K samples to see if result would still hold. This resulted in
> completely different parameter choices, so I ran 280K samples overnight, to
> see whether at least the choices would stabilise as n -> inf. Then when I
> printed the top 10 models, I got the following:
>
> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
> key=lambda x: x
> [1])
>
> In [8]: bests[:10]
> Out[8]:
> [mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
> True, '
> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
> True, '
> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
> True, '
> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
> True, '
> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
> True, '
> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
> False, '
> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
> False,
> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
> False, '
> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
> False,
> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
> False,
> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
>
> Needless to say, perfect accuracy is suspicious, and indeed, in this case,
> completely spurious:
>
> In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
> 'bootstr
> ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
> 'gini'})
>
> In [17]: rftop.fit(X[:20], y[:20])
>
> In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
> Out[20]: 0.826125
>
> That's more in line with what's expected for this dataset, and what was
> found by the search with 72K samples (top model: [mean: 0.82640, std:
> 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
> 20, 'max_depth': 20, 'criterion': 'gini'},)
>
> Anyway, here's my code, any idea why more samples would cause this
> overfitting / testing on training data?
>
> # [omitted: boilerplate to load full data in X0, y0]
> import numpy as np
> idx = np.random.choice(len(y0), size=28, replace=False)
> X, y = X0[idx], y0[idx]
> param_dist = {'n_estimators': [20, 100, 200, 500],
>   'max_depth': [3, 5, 20, None],
>   'max_features': ['auto', 5, 10, 20],
>   'bootstrap': [True, False],
>   'criterion': ['gini', 'entropy']}
> from sklearn import grid_search as gs
> from time import time
> from sklearn import ensemble
> rf = ensemble.RandomForestClassifier()
> random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
> verbose=2, n_jobs=12)
> start=time(); random_search.fit(X, y); stop=time()
>
> Thank you!
>
> Juan.
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn 
> more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
>
>
>
> ___
> 

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Andreas Mueller

Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation 
used the beginning of the dataset vs the rest.

Your data is probably not IID.


On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:

Hi all,

TL;DR: when I run GridSearchCV with RandomForestClassifier and "many" 
samples (280K), it falsely shows accuracy of 1.0 for full trees 
(max_depth=None). This doesn't happen for fewer samples.


Longer version:

I'm trying to optimise RF hyperparameters using GridSearchCV for the 
first time. I have a lot of data (~3M samples, 140 features), so I 
subsampled it to do this. First I subsampled to 3000 samples, which 
finished in 5min, so I ran 70K samples to see if result would still 
hold. This resulted in completely different parameter choices, so I 
ran 280K samples overnight, to see whether at least the choices would 
stabilise as n -> inf. Then when I printed the top 10 models, I got 
the following:


In [7]: bests = sorted(random_search.grid_scores_, reverse=True, 
key=lambda x: x

[1])

In [8]: bests[:10]
Out[8]:
[mean: 1.0, std: 0.0, params: {'n_estimators': 500, 
'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 500, 
'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 200, 
'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 200, 
'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 200, 
'bootstrap': True, '

max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 20, 
'bootstrap': False, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 100, 
'bootstrap': False,

'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 20, 
'bootstrap': False, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 100, 
'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
 mean: 1.0, std: 0.0, params: {'n_estimators': 500, 
'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]

Needless to say, perfect accuracy is suspicious, and indeed, in this 
case, completely spurious:


In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 
20, 'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 
'gini'})


In [17]: rftop.fit(X[:20], y[:20])

In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
Out[20]: 0.826125

That's more in line with what's expected for this dataset, and what 
was found by the search with 72K samples (top model: [mean: 0.82640, 
std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 
'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)


Anyway, here's my code, any idea why more samples would cause this 
overfitting / testing on training data?


# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=28, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
  'max_depth': [3, 5, 20, None],
  'max_features': ['auto', 5, 10, 20],
  'bootstrap': [True, False],
  'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()

Thank you!

Juan.


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general 

Re: [Scikit-learn-general] Class Weight Random Forest Classifier

2016-04-12 Thread Andreas Mueller
Another possibility is to threshold the predict_proba differently, such 
that the decision maximizes whatever metric you have defined.



On 03/15/2016 07:44 AM, Mamun Rashid wrote:

Hi All,
I have asked this question couple of weeks ago on the list. I have a 
two class problem where my positive class ( Class 1 ) and negative 
class ( Class 0 )
is imbalanced. Secondly I care much less about the negative class. So, 
I specified both class weight (to a random forest classifier) and 
sample wright to

the fit function to give more importance to my positive class.

cl_weight = {0:weight1,1:weight2}
clf= RandomForestClassifier(n_estimators=400, max_depth=None, 
min_samples_split=2, random_state=0, oob_score=True, class_weight = 
cl_weight, criterion=*“g**ini*")

sample_weight = np.array([weightif m ==1 else 1 for min df_tr[label_column]])
y_pred  = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)
Despite specifying dramatically different class weight I do not 
observe much difference. Example :: cl_weight = {0:0.001, 1:0.999} and 
cl_weight = {0:0.50, 1:0.50}. Am I passing the class weight correctly ?
I am giving example of two folds from these two runs :: Fold 1 and 
Fold 2.

## cl_weight = {0:0.001, 1:0.999}
Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion 
Matrix 0 1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}
Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion 
Matrix 0 1 0 1676 14 1 709 170

Thanks,
Mamun


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Vlad Niculae
I would definitely join the sprint, anything after June 17 works for
me. I was thinking to come hang around during ICML, even if I might
not be able to afford the conference.

Cheers,
Vlad

On Tue, Apr 12, 2016 at 11:39 AM, Andreas Mueller  wrote:
> So should we pick another or possibly an additional date?
> Will anyone be in NYC for ICML / UAI / COLT?
>
> On 04/12/2016 03:56 AM, Alexandre Gramfort wrote:
>>> Sorry, ICML is at the same dates as the big brain imaging conference, so
>>> I will not be able to attend (neither the conference, nor a sprint).
>> same for me. Surprisingly...
>>
>> Alex
>>
>> --
>> Find and fix application performance issues faster with Applications Manager
>> Applications Manager provides deep performance insights into multiple tiers 
>> of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial!
>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> --
> Find and fix application performance issues faster with Applications Manager
> Applications Manager provides deep performance insights into multiple tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Andreas Mueller
So should we pick another or possibly an additional date?
Will anyone be in NYC for ICML / UAI / COLT?

On 04/12/2016 03:56 AM, Alexandre Gramfort wrote:
>> Sorry, ICML is at the same dates as the big brain imaging conference, so
>> I will not be able to attend (neither the conference, nor a sprint).
> same for me. Surprisingly...
>
> Alex
>
> --
> Find and fix application performance issues faster with Applications Manager
> Applications Manager provides deep performance insights into multiple tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] load_svmlight_file value error

2016-04-12 Thread Gunjan Dewan
Hi Manjush,

Yes, this issue has been reported.

You can use the data from the following link. It's train and test data sets
do not have spaces between commas, so I was able to load this using
svmlight.

Link :
http://research.microsoft.com/en-us/um/people/manik/downloads/XC/XMLRepository.html

On Tue, Apr 12, 2016 at 3:54 PM, Manjush Vundemodalu 
wrote:

>
> Is this issue reported already ? I am getting same error while trying to
> load kaggle train.csv (same file) with load_svmlight_file
>
> Regards,
> Manjush
>
> On Sat, Feb 13, 2016 at 9:56 AM Gunjan Dewan 
> wrote:
>
>> Ill do that.
>>
>> Thanks a lot.
>>
>> Gunjan
>>
>> On Sat, Feb 13, 2016 at 6:04 AM, Mathieu Blondel 
>> wrote:
>>
>>> It seems like our svmlight reader doesn't support spaces between labels:
>>>
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_svmlight_format.pyx#L71
>>>
>>> Could you report an issue on github?
>>>
>>> In the mean time, you can write a small Python script that deletes the
>>> space between labels.
>>>
>>> Mathieu
>>>
>>> On Fri, Feb 12, 2016 at 11:00 PM, Gunjan Dewan <
>>> dewangunjan6...@gmail.com> wrote:
>>>
 Hi Mathieu,

 Thanks a lot for the help.
 But even after changing the multilabel option it is giving a value
 error :


   File "_svmlight_format.pyx", line 67, in
 sklearn.datasets._svmlight_format._load_svmlight_file
 (sklearn\datasets\_svmlight_format.c:2055)

 ValueError: could not convert string to float:



 But this time, it does not show any value after the error. Its blank.
 Any idea why this is happening?


 Gunjan

 On Fri, Feb 12, 2016 at 6:59 PM, Mathieu Blondel 
 wrote:

> Hi Gunjan,
>
> Apparently the dataset is multi-label, so you need to use the
> multilabel=True option.
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
>
> Mathieu
>
> On Fri, Feb 12, 2016 at 10:04 PM, Gunjan Dewan <
> dewangunjan6...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am using the following dataset from kaggle (train.csv):
>> https://www.kaggle.com/c/lshtc/data
>>
>> The dataset is in libSVM format.
>>
>> However while trying to load it using load_svmlight_file, i get the
>> following error
>>
>> File "_svmlight_format.pyx", line 72, in
>> sklearn.datasets._svmlight_format._load_svmlight_file
>> (sklearn\datasets\_svmlight_format.c:2120)
>>
>> ValueError: could not convert string to float: b'Data'
>>
>> I then removed the header but it is still giving me the same value
>> error.
>> Can anyone please help me out with this?
>>
>> I also wanted to know if there is any other way to convert the libSVM
>> format into 2 matrices.
>>
>> Note : I just started out with sklearn and machine learning.
>>
>> Thanks,
>> Gunjan
>>
>>
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

>>>
>>
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
>> ___
>> Scikit-learn-general mailing list
>> 

Re: [Scikit-learn-general] Data properties for mutual information feature selection

2016-04-12 Thread Manjush Vundemodalu
It depends on your problem statement and data set you are using to train
your model. Can you be more specific

Regards,
Manjush

On Wed, Feb 17, 2016 at 8:26 AM Shishir Pandey  wrote:

> Hi
>
> What properties of data should I look at to justify that mutual
> information is a good feature selection method for the it.
>
>
> --
> sp
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] load_svmlight_file value error

2016-04-12 Thread Manjush Vundemodalu
Is this issue reported already ? I am getting same error while trying to
load kaggle train.csv (same file) with load_svmlight_file

Regards,
Manjush

On Sat, Feb 13, 2016 at 9:56 AM Gunjan Dewan 
wrote:

> Ill do that.
>
> Thanks a lot.
>
> Gunjan
>
> On Sat, Feb 13, 2016 at 6:04 AM, Mathieu Blondel 
> wrote:
>
>> It seems like our svmlight reader doesn't support spaces between labels:
>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_svmlight_format.pyx#L71
>>
>> Could you report an issue on github?
>>
>> In the mean time, you can write a small Python script that deletes the
>> space between labels.
>>
>> Mathieu
>>
>> On Fri, Feb 12, 2016 at 11:00 PM, Gunjan Dewan > > wrote:
>>
>>> Hi Mathieu,
>>>
>>> Thanks a lot for the help.
>>> But even after changing the multilabel option it is giving a value error
>>> :
>>>
>>>
>>>   File "_svmlight_format.pyx", line 67, in
>>> sklearn.datasets._svmlight_format._load_svmlight_file
>>> (sklearn\datasets\_svmlight_format.c:2055)
>>>
>>> ValueError: could not convert string to float:
>>>
>>>
>>>
>>> But this time, it does not show any value after the error. Its blank.
>>> Any idea why this is happening?
>>>
>>>
>>> Gunjan
>>>
>>> On Fri, Feb 12, 2016 at 6:59 PM, Mathieu Blondel 
>>> wrote:
>>>
 Hi Gunjan,

 Apparently the dataset is multi-label, so you need to use the
 multilabel=True option.


 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html

 Mathieu

 On Fri, Feb 12, 2016 at 10:04 PM, Gunjan Dewan <
 dewangunjan6...@gmail.com> wrote:

> Hi all,
>
> I am using the following dataset from kaggle (train.csv):
> https://www.kaggle.com/c/lshtc/data
>
> The dataset is in libSVM format.
>
> However while trying to load it using load_svmlight_file, i get the
> following error
>
> File "_svmlight_format.pyx", line 72, in
> sklearn.datasets._svmlight_format._load_svmlight_file
> (sklearn\datasets\_svmlight_format.c:2120)
>
> ValueError: could not convert string to float: b'Data'
>
> I then removed the header but it is still giving me the same value
> error.
> Can anyone please help me out with this?
>
> I also wanted to know if there is any other way to convert the libSVM
> format into 2 matrices.
>
> Note : I just started out with sklearn and machine learning.
>
> Thanks,
> Gunjan
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


 --
 Site24x7 APM Insight: Get Deep Visibility into Application Performance
 APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
 Monitor end-to-end web transactions and take corrective actions now
 Troubleshoot faster and improve end-user experience. Signup Now!
 http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>>>
>>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Alexandre Gramfort
> Sorry, ICML is at the same dates as the big brain imaging conference, so
> I will not be able to attend (neither the conference, nor a sprint).

same for me. Surprisingly...

Alex

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general