Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
On Wed, Sep 09, 2015 at 02:10:05PM -0400, Andreas Mueller wrote: > I see two possible ways forward: > a) Make the functions private and deprecate the public interface, like > k_means, lars_path, These functions are important for reuse in an algorithmic setting: if I am doing an algorithm that uses k-means or lars_path inside the algorithm, it is much more natural to use the functions, and they have less overhead. I think that the target usecase for the functions is not the same as for objects. They target more advanced users who understand better what they do. For this reason, things like input-parameter validation, that are heavily tested by the common tests, should probably not be in the functions (they induce overhead which may be quite important inside an algorithm). In a sense, I feel that common tests are less important, and maybe not wanted for functions, as we will be putting expections all the time. Gaël -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
On 09/10/2015 10:08 AM, Gael Varoquaux wrote: >> >And your statement "they are for advanced users" is not manifested in >> >the API or documentation. > OK, but that's a bug of the documentation. So you suggest adding to the docstring of every function "this is for advanced users only"? That is kind of like making them private, only that private is much more explicit. >> >There is no reason a user would expect one to act different from the other. > Users who don't code aglorithms probably don't have any reason to be > using them. > Well the reason would be they find them in the API docs and they don't know whether to use the class or the function. It is fair to summarize your opinion as "functions don't need input validation or a consistent interface, the documentation should make clear they are for advanced users"? FWIW many of the functions do input validation at the moment, it is just inconsistent. -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
On 09/10/2015 09:22 AM, Gael Varoquaux wrote: > > These functions are important for reuse in an algorithmic setting: if I > am doing an algorithm that uses k-means or lars_path inside the > algorithm, it is much more natural to use the functions, and they have > less overhead. > > I think that the target usecase for the functions is not the same as for > objects. They target more advanced users who understand better what they > do. For this reason, things like input-parameter validation, that are > heavily tested by the common tests, should probably not be in the > functions (they induce overhead which may be quite important inside an > algorithm). In a sense, I feel that common tests are less important, and > maybe not wanted for functions, as we will be putting expections all the > time. I feel it is quite awkward if the function and the estimator have different requirements on X. And your statement "they are for advanced users" is not manifested in the API or documentation. There is no reason a user would expect one to act different from the other. Why do you say the functions have less overhead? And why are they more natural to use? cluster_centers = kmeans(X, n_clusters=10) is a bit shorter than cluster_centers = KMeans(n_clusters=10).fit_predict(X) but the difference is really not that much. -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
On Thu, Sep 10, 2015 at 09:52:44AM -0400, Andy wrote: > I feel it is quite awkward if the function and the estimator have > different requirements on X. That's a point of view. But they are different things, so I am not sure that this point of view is universal. > And your statement "they are for advanced users" is not manifested in > the API or documentation. OK, but that's a bug of the documentation. > There is no reason a user would expect one to act different from the other. Users who don't code aglorithms probably don't have any reason to be using them. > Why do you say the functions have less overhead? They don't have to do things like parameter validation, and all the book-keeping that goes with maintaining the consistent state of the object. > And why are they more natural to use? People writing algorithms are not used to think in terms of objects. > cluster_centers = kmeans(X, n_clusters=10) > is a bit shorter than > cluster_centers = KMeans(n_clusters=10).fit_predict(X) > but the difference is really not that much. Functions implement algorithms. With an input and an ouptut. Objects implement a predictor, constrained by what we define is a predictor. It's not obvious for a given algorithm, what the corresponding prediction API is. The input might not always be a data matrix, and the output is not always naturally by one of our methods. In this respect, the k-means problem is a good example. People writing algorithms using k-means do not think in terms of 'fit_predict'. There is of course value to have objects: if some of the operations, or the inner state of the algorithm, are reused, the objects are great. But if we just want to write for instance a parallel loop, functions can be better (no internal state is a good thing when dealing with concurrency). Gaël -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
A reflective response without a clear opinion: I'll admit to rarely-if-ever using function versions, and suspect they frequently have limited utility over the estimator interface. Occasionally they even wrap the estimator interface, so they're not going to provide the efficiency advantages Gaël talks about. While "People writing algorithms are not used to think in terms of objects.", such people still know how to wrap an object to make it look like a function. Seeing as there has been no consistent approach to developing functional learners, I think that there are many functions that effectively provide (data, estimator parameters) -> model attributes. This is clearly a nice functional abstraction, but in truth, only those functions that accept more/different parameters from their estimator cousins, for instance only solve part of the learning problem, are distinctively useful. >From an API development perspective, functions that return model parameters can be frustrating; they land up accumulating return_something flags in order to fit changing/expanding output needs, while estimators act as a namespace where diagnostic output can be dumped, usually at very little cost. As with output, users may expect function input (i.e. argument ordering) to be more fixed, in comparison to estimators where separating data from parameters means it is more natural to use kwargs in construction, or simply use set_params or attribute setting. So from the perspective of version compatibility the function versions are harder to maintain, and we've not yet really ascertained their benefit. Their presence in the public API often duplicates the cost of maintaining docstrings. But we could fairly disregard this issue, in part because even when private we'd appreciate clear and explicit parameter/returns documentation. @Andy, the documentation implies these are for advanced use by (generally) not referencing them in the narrative documentation. I think that's a fair way to keep them only for the sight of those who dig deeper, but this implicitness leaves some maintenance risks. While I don't think a note in the docstring of each function version is the right solution, "See Also" could be used to indicate the relationship. Additionally, or alternatively, we could split classes.rst into "Estimators", "Low-level learning functions" and "Utilities". On 11 September 2015 at 01:21, Andreas Muellerwrote: > > > On 09/10/2015 10:08 AM, Gael Varoquaux wrote: > >> >And your statement "they are for advanced users" is not manifested in > >> >the API or documentation. > > OK, but that's a bug of the documentation. > So you suggest adding to the docstring of every function "this is for > advanced users only"? > That is kind of like making them private, only that private is much more > explicit. > >> >There is no reason a user would expect one to act different from the > other. > > Users who don't code aglorithms probably don't have any reason to be > > using them. > > > Well the reason would be they find them in the API docs and they don't > know whether to use the class or the function. > > It is fair to summarize your opinion as > "functions don't need input validation or a consistent interface, the > documentation should make clear they > are for advanced users"? > > FWIW many of the functions do input validation at the moment, it is just > inconsistent. > > > -- > Monitor Your Dynamic Infrastructure at Any Scale With Datadog! > Get real-time metrics from all of your servers, apps and tools > in one place. > SourceForge users - Click here to start your Free Trial of Datadog now! > http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Common tests for functions vs deprecating functions
On 09/10/2015 06:16 PM, Joel Nothman wrote: > the documentation implies these are for advanced use by (generally) > not referencing them in the narrative documentation. I think that's a > fair way to keep them only for the sight of those who dig deeper, but > this implicitness leaves some maintenance risks. I agree that this is sort-of the distinction, though I haven't fact-checked that they are actually not used. But it is pretty implicit, and I don't think looking at the API docs counts digging deeper. > While I don't think a note in the docstring of each function version > is the right solution, "See Also" could be used to indicate the > relationship. How do you mean? > Additionally, or alternatively, we could split classes.rst into > "Estimators", "Low-level learning functions" and "Utilities". That is actually done for the SVM module: http://scikit-learn.org/dev/modules/classes.html#module-sklearn.svm For cluster it is less clear: http://scikit-learn.org/dev/modules/classes.html#module-sklearn.cluster @joel do you think it is enough to test the functions via the estimators? I feel that if we do provide this interface, it should either be explicitly low-level (the SVM functions are pretty clear about that, they only accept y to be a 64bit float), or should have a well-tested interface. -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Contributing to scikit-learn
Hi Gael, Heeding your advice, I was looking over the possible bugs and I have decided to solve this one: https://github.com/scikit-learn/scikit-learn/issues/5229. Any pointers on how to approach this one? Thanks, Rohit. On Thu, Sep 10, 2015 at 10:27 AM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > I would strongly recommend to start with something easier, like issues > labelled 'easy'. Starting with such a big project is most likely going to > lead to you approaching the project in a way that is not well adapted to > scikit-learn, and thus code that does not get merged. > > Cheers, > > Gaël > > On Thu, Sep 10, 2015 at 06:58:20AM +0530, Rohit Shinde wrote: > > Hello everyone, > > > I have built scikit-learn and I am ready to start coding. Can I get some > > pointers on how I could start contributing to the projects I mentioned > in the > > earlier mail? > > > Thanks, > > Rohit. > > > On Mon, Sep 7, 2015 at 11:50 AM, Rohit Shinde < > rohit.shinde12...@gmail.com> > > wrote: > > > Hi Jacob, > > > I am interested in Global optimization based hyperparameter > optimization > > and Generalised Additive Models. However, I don't know what kind of > > background would be needed and if mine would be sufficient for it. I > would > > like to know the prerequisites for it. > > > On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber < > jmschreibe...@gmail.com> > > wrote: > > > Hi Rohit > > > I'm glad you want to contribute to scikit-learn! Which idea were > you > > interested in working on? The metric learning and GMM code is > currently > > being worked on by GSOC students AFAIK. > > > Jacob > > > On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde < > > rohit.shinde12...@gmail.com> wrote: > > > Hello everyone, > > > I am Rohit. I am interested in contributing toward > scikit-learn. I > > am quite proficient in Python, Java, C++ and scheme. I have > taken > > undergrad courses in Machine Learning and data mining. I was > also > > part of this year's GSoC under The Opencog Foundation. > > > I was looking at the ideas list for GSoC and I would be > interested > > in working on one of those ideas. So, could I get some > guidance? > > > Thank you, > > Rohit Shinde. > > > > > -- > > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > -- > > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > > -- > > Monitor Your Dynamic Infrastructure at Any Scale With Datadog! > > Get real-time metrics from all of your servers, apps and tools > > in one place. > > SourceForge users - Click here to start your Free Trial of Datadog now! > > http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 > > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux > > > -- > Monitor Your Dynamic Infrastructure at Any Scale With Datadog! > Get real-time metrics from all of your servers, apps and tools > in one place. > SourceForge users - Click here to start your Free Trial of Datadog now! > http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general