Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Gael Varoquaux
On Wed, Sep 09, 2015 at 02:10:05PM -0400, Andreas Mueller wrote:
> I see two possible ways forward:
> a) Make the functions private and deprecate the public interface, like 
> k_means, lars_path, 

These functions are important for reuse in an algorithmic setting: if I
am doing an algorithm that uses k-means or lars_path inside the
algorithm, it is much more natural to use the functions, and they have
less overhead.

I think that the target usecase for the functions is not the same as for
objects. They target more advanced users who understand better what they
do. For this reason, things like input-parameter validation, that are
heavily tested by the common tests, should probably not be in the
functions (they induce overhead which may be quite important inside an
algorithm). In a sense, I feel that common tests are less important, and
maybe not wanted for functions, as we will be putting expections all the
time.

Gaël

--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Andreas Mueller


On 09/10/2015 10:08 AM, Gael Varoquaux wrote:
>> >And your statement "they are for advanced users" is not manifested in
>> >the API or documentation.
> OK, but that's a bug of the documentation.
So you suggest adding to the docstring of every function "this is for 
advanced users only"?
That is kind of like making them private, only that private is much more 
explicit.
>> >There is no reason a user would expect one to act different from the other.
> Users who don't code aglorithms probably don't have any reason to be
> using them.
>
Well the reason would be they find them in the API docs and they don't 
know whether to use the class or the function.

It is fair to summarize your opinion as
"functions don't need input validation or a consistent interface, the 
documentation should make clear they
are for advanced users"?

FWIW many of the functions do input validation at the moment, it is just 
inconsistent.

--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Andy
On 09/10/2015 09:22 AM, Gael Varoquaux wrote:
>
> These functions are important for reuse in an algorithmic setting: if I
> am doing an algorithm that uses k-means or lars_path inside the
> algorithm, it is much more natural to use the functions, and they have
> less overhead.
>
> I think that the target usecase for the functions is not the same as for
> objects. They target more advanced users who understand better what they
> do. For this reason, things like input-parameter validation, that are
> heavily tested by the common tests, should probably not be in the
> functions (they induce overhead which may be quite important inside an
> algorithm). In a sense, I feel that common tests are less important, and
> maybe not wanted for functions, as we will be putting expections all the
> time.
I feel it is quite awkward if the function and the estimator have 
different requirements on X.
And your statement "they are for advanced users" is not manifested in 
the API or documentation.
There is no reason a user would expect one to act different from the other.

Why do you say the functions have less overhead?
And why are they more natural to use?

cluster_centers = kmeans(X, n_clusters=10)

is a bit shorter than

cluster_centers = KMeans(n_clusters=10).fit_predict(X)

but the difference is really not that much.

--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Gael Varoquaux
On Thu, Sep 10, 2015 at 09:52:44AM -0400, Andy wrote:
> I feel it is quite awkward if the function and the estimator have 
> different requirements on X.

That's a point of view. But they are different things, so I am not sure
that this point of view is universal.

> And your statement "they are for advanced users" is not manifested in 
> the API or documentation.

OK, but that's a bug of the documentation.

> There is no reason a user would expect one to act different from the other.

Users who don't code aglorithms probably don't have any reason to be
using them.

> Why do you say the functions have less overhead?

They don't have to do things like parameter validation, and all the
book-keeping that goes with maintaining the consistent state of the
object.

> And why are they more natural to use?

People writing algorithms are not used to think in terms of objects.

> cluster_centers = kmeans(X, n_clusters=10)

> is a bit shorter than

> cluster_centers = KMeans(n_clusters=10).fit_predict(X)

> but the difference is really not that much.

Functions implement algorithms. With an input and an ouptut. Objects
implement a predictor, constrained by what we define is a predictor. It's
not obvious for a given algorithm, what the corresponding prediction API
is. The input might not always be a data matrix, and the output is not
always naturally by one of our methods. In this respect, the k-means
problem is a good example. People writing algorithms using k-means do not
think in terms of 'fit_predict'.

There is of course value to have objects: if some of the operations, or
the inner state of the algorithm, are reused, the objects are great. But
if we just want to write for instance a parallel loop, functions can be
better (no internal state is a good thing when dealing with concurrency).

Gaël



--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Joel Nothman
A reflective response without a clear opinion:

I'll admit to rarely-if-ever using function versions, and suspect they
frequently have limited utility over the estimator interface. Occasionally
they even wrap the estimator interface, so they're not going to provide the
efficiency advantages Gaël talks about.

While "People writing algorithms are not used to think in terms of
objects.", such people still know how to wrap an object to make it look
like a function. Seeing as there has been no consistent approach to
developing functional learners, I think that there are many functions that
effectively provide (data, estimator parameters) -> model attributes. This
is clearly a nice functional abstraction, but in truth, only those
functions that accept more/different parameters from their estimator
cousins, for instance only solve part of the learning problem, are
distinctively useful.

>From an API development perspective, functions that return model parameters
can be frustrating; they land up accumulating return_something flags in
order to fit changing/expanding output needs, while estimators act as a
namespace where diagnostic output can be dumped, usually at very little
cost. As with output, users may expect function input (i.e. argument
ordering) to be more fixed, in comparison to estimators where separating
data from parameters means it is more natural to use kwargs in
construction, or simply use set_params or attribute setting. So from the
perspective of version compatibility the function versions are harder to
maintain, and we've not yet really ascertained their benefit.

Their presence in the public API often duplicates the cost of maintaining
docstrings. But we could fairly disregard this issue, in part because even
when private we'd appreciate clear and explicit parameter/returns
documentation.

@Andy, the documentation implies these are for advanced use by (generally)
not referencing them in the narrative documentation. I think that's a fair
way to keep them only for the sight of those who dig deeper, but this
implicitness leaves some maintenance risks. While I don't think a note in
the docstring of each function version is the right solution, "See Also"
could be used to indicate the relationship. Additionally, or alternatively,
we could split classes.rst into "Estimators", "Low-level learning
functions" and "Utilities".

On 11 September 2015 at 01:21, Andreas Mueller  wrote:

>
>
> On 09/10/2015 10:08 AM, Gael Varoquaux wrote:
> >> >And your statement "they are for advanced users" is not manifested in
> >> >the API or documentation.
> > OK, but that's a bug of the documentation.
> So you suggest adding to the docstring of every function "this is for
> advanced users only"?
> That is kind of like making them private, only that private is much more
> explicit.
> >> >There is no reason a user would expect one to act different from the
> other.
> > Users who don't code aglorithms probably don't have any reason to be
> > using them.
> >
> Well the reason would be they find them in the API docs and they don't
> know whether to use the class or the function.
>
> It is fair to summarize your opinion as
> "functions don't need input validation or a consistent interface, the
> documentation should make clear they
> are for advanced users"?
>
> FWIW many of the functions do input validation at the moment, it is just
> inconsistent.
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Andreas Mueller


On 09/10/2015 06:16 PM, Joel Nothman wrote:
> the documentation implies these are for advanced use by (generally) 
> not referencing them in the narrative documentation. I think that's a 
> fair way to keep them only for the sight of those who dig deeper, but 
> this implicitness leaves some maintenance risks.
I agree that this is sort-of the distinction, though I haven't 
fact-checked that they are actually not used.
But it is pretty implicit, and I don't think looking at the API docs 
counts digging deeper.
> While I don't think a note in the docstring of each function version 
> is the right solution, "See Also" could be used to indicate the 
> relationship.
How do you mean?
> Additionally, or alternatively, we could split classes.rst into 
> "Estimators", "Low-level learning functions" and "Utilities".
That is actually done for the SVM module:
http://scikit-learn.org/dev/modules/classes.html#module-sklearn.svm

For cluster it is less clear:
http://scikit-learn.org/dev/modules/classes.html#module-sklearn.cluster


@joel do you think it is enough to test the functions via the estimators?
I feel that if we do provide this interface, it should either be 
explicitly low-level (the SVM functions are pretty clear about that,
they only accept y to be a 64bit float), or should have a well-tested 
interface.

--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-10 Thread Rohit Shinde
Hi Gael,

Heeding your advice, I was looking over the possible bugs and I have
decided to solve this one:
https://github.com/scikit-learn/scikit-learn/issues/5229.

Any pointers on how to approach this one?

Thanks,
Rohit.

On Thu, Sep 10, 2015 at 10:27 AM, Gael Varoquaux <
gael.varoqu...@normalesup.org> wrote:

> I would strongly recommend to start with something easier, like issues
> labelled 'easy'. Starting with such a big project is most likely going to
> lead to you approaching the project in a way that is not well adapted to
> scikit-learn, and thus code that does not get merged.
>
> Cheers,
>
> Gaël
>
> On Thu, Sep 10, 2015 at 06:58:20AM +0530, Rohit Shinde wrote:
> > Hello everyone,
>
> > I have built scikit-learn and I am ready to start coding. Can I get some
> > pointers on how I could start contributing to the projects I mentioned
> in the
> > earlier mail?
>
> > Thanks,
> > Rohit.
>
> > On Mon, Sep 7, 2015 at 11:50 AM, Rohit Shinde <
> rohit.shinde12...@gmail.com>
> > wrote:
>
> > Hi Jacob,
>
> > I am interested in Global optimization based hyperparameter
> optimization
> > and Generalised Additive Models. However, I don't know what kind of
> > background would be needed and if mine would be sufficient for it. I
> would
> > like to know the prerequisites for it.
>
> > On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber <
> jmschreibe...@gmail.com>
> > wrote:
>
> > Hi Rohit
>
> > I'm glad you want to contribute to scikit-learn! Which idea were
> you
> > interested in working on? The metric learning and GMM code is
> currently
> > being worked on by GSOC students AFAIK.
>
> > Jacob
>
> > On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde <
> > rohit.shinde12...@gmail.com> wrote:
>
> > Hello everyone,
>
> > I am Rohit. I am interested in contributing toward
> scikit-learn. I
> > am quite proficient in Python, Java, C++ and scheme. I have
> taken
> > undergrad courses in Machine Learning and data mining. I was
> also
> > part of this year's GSoC under The Opencog Foundation.
>
> > I was looking at the ideas list for GSoC and I would be
> interested
> > in working on one of those ideas. So, could I get some
> guidance?
>
> > Thank you,
> > Rohit Shinde.
>
> >
>  
> --
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> >
>  
> --
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
>
>
> >
> --
> > Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> > Get real-time metrics from all of your servers, apps and tools
> > in one place.
> > SourceForge users - Click here to start your Free Trial of Datadog now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general