Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Matthias Feurer

Dear Joel,

Thank you for taking the time to answer my email. I didn't see the PR on 
this topic, thanks for pointing me to that. I can see your points with 
regards to the get_params() method and it might be better if I write 
more serialization code on my side (although for example 
RandomizedSearchCV also returns a lot of parameters one would not 
consider searching over).


Nevertheless, I still think it would be a good idea to have distribution 
objects in scikit-learn since some common use cases cannot be easily 
handled with scipy.stats (see my last email for examples).


Best regards,
Matthias

On 07.05.2016 14:41, Joel Nothman wrote:
On 7 May 2016 at 19:12, Matthias Feurer 
> wrote:


1. Return the fit and predict time in `grid_scores_`


This has been proposed for many years as part of an overhaul of 
grid_scores_. The latest attempt is currently underway at 
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good 
chance of being merged.


2. Add distribution objects to scikit-learn which have get_params and
set_params attributes


Your use of get_params to perform serialisation is certainly not what 
get_params is designed for, though I understand your use of it that 
way... as long as all your parameters are either primitives or objects 
supporting get_params. However, this is not by design. Further, 
param_distributions is a dict whose values are scipy.stats rvs; 
get_params currently does not traverse dicts, so this is already 
unfamiliar territory requiring a lot of design, even once we were 
convinced that this were a valuable use-case, which I am not certain of.


3. Add get_params and set_params to CV objects


get_params and set_params are intended to allow programmatic search 
over those parameter settings. This is not often what one does with 
the parameters of CV splitting methods, but I acknowledge that 
supporting this would not be difficult. Still, if serialisation is the 
purpose of this, it's not really the point.




--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Joel Nothman
On 7 May 2016 at 19:12, Matthias Feurer 
wrote:


> 1. Return the fit and predict time in `grid_scores_`
>

This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good
chance of being merged.


> 2. Add distribution objects to scikit-learn which have get_params and
> set_params attributes
>

Your use of get_params to perform serialisation is certainly not what
get_params is designed for, though I understand your use of it that way...
as long as all your parameters are either primitives or objects supporting
get_params. However, this is not by design. Further, param_distributions is
a dict whose values are scipy.stats rvs; get_params currently does not
traverse dicts, so this is already unfamiliar territory requiring a lot of
design, even once we were convinced that this were a valuable use-case,
which I am not certain of.


> 3. Add get_params and set_params to CV objects
>

get_params and set_params are intended to allow programmatic search over
those parameter settings. This is not often what one does with the
parameters of CV splitting methods, but I acknowledge that supporting this
would not be difficult. Still, if serialisation is the purpose of this,
it's not really the point.
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Matthias Feurer
Dear scikit-learn team,

First of all, the model selection module is really easy to use and has a 
nice and clean interface, I really like that. Nevertheless, while using 
it for benchmarks I found some shortcomings where I think the module 
could be improved.

1. Return the fit and predict time in `grid_scores_`

BaseSearchCV relies on a function called _fit_and_score to produce the 
entries in grid_scores_. This function measures the time it takes to fit 
a model, predict for the (cross-)validation set and calculate the score. 
It returns this time, which is then discarded: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py#L569

I propose to store this time in grid_scores_ and make it accessible to 
the user. Also, the time taken to refit the model in line 596 and 
following should be measured and made accessible to the user.

2. Add distribution objects to scikit-learn which have get_params and 
set_params attributes

When printing the parameter distribution proposed for the model 
selection module (scipy.stats), the result is something which cannot be 
parsed:



It's also not possible to access this with the scikit-learn like methods 
get_params() and set_params() (actually, the first of both should 
suffice). I propose to add distribution objects for commonly used 
distributions:

1. Categorical variables - replace previously used lists
2. RandInt - replace scipy.stats.randint
3. Uniform - might replace scipy.stats.uniform, I'm not sure if that 
would accept a lower and an upper bound at construction time
4. LogUniform - does not exist so far, useful for search C and gamma in 
SVMs, learning rate in NNs etc.
5. LogUniformInt - same thing, but as an Integer, useful for the 
min_samples_split in RF and ET
6. MultipleUniformInt - this is a bit weird as it would return a tuple 
of Integers, but I could not find any other way to tune both the number 
of hidden layers and their size in the MLPClassifier

3. Add get_params and set_params to CV objects

Currently, the CV objects like StratifiedKFold look nice when printed, 
but it is not possible to access their parameters programatically in 
order to serialize them (without pickle). Since they are part of the 
BaseSearchCV and returned by a call to BaseSearchCV.get_params(), I 
propose to add parameter setter and getter to the CV objects as well to 
maintain a consistent interface.


I think these changes are not too hard to implement and I am willing to 
do so if you approve these suggestions.

Best regards,
Matthias

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general