Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-27 Thread Mathieu Blondel
Hi, I will attend ICML and probably COLT too. Not sure about a sprint but definitely up for a scikit-learn lunch / dinner. See you in NYC, Mathieu On Tue, Apr 19, 2016 at 6:00 AM, Alexandre Gramfort < alexandre.gramf...@telecom-paristech.fr> wrote: > hi Andy, > > there is no plan at this time t

Re: [Scikit-learn-general] Why does SV regression crashes here ?

2016-04-21 Thread Mathieu Blondel
Another remark is that you set C=1e3. Depending on the scaling of your data, this can be quite large. This means that the SVM is very lightly regularized (=> hard SVM) and therefore the problem is ill-conditioned. Mathieu On Thu, Apr 21, 2016 at 11:51 PM, Mathieu Blondel wrote: > By d

Re: [Scikit-learn-general] Why does SV regression crashes here ?

2016-04-21 Thread Mathieu Blondel
By default, SVC stops only when the desired tolerance is reached. If the problem is poorly scaled, this can indeed take ages. You can however set max_iter to prevent this. http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html We might want to change the default from -1 to somethin

Re: [Scikit-learn-general] [scikit-learn-general] Why sklearn RandomForest model take a lot of disk space after save?

2016-04-10 Thread Mathieu Blondel
You may also want to save your model using joblib (possibly with compression enabled) instead of cPickle. Mathieu On Sun, Apr 10, 2016 at 9:13 AM, Piotr Płoński wrote: > Hi All, > > I am saving RandomForestClassifier model from sklearn library with code > below > > with open('/tmp/rf.model', 'w

Re: [Scikit-learn-general] Stochastic dual coordinate aescent solver for linear models

2016-04-10 Thread Mathieu Blondel
And also in LinearSVC with dual=True. The only difference is that the choice of dual variable is cyclic (with prior permutation) instead of random. See this 2008 paper: http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf Mathieu On Sun, Apr 10, 2016 at 9:53 PM, Alexandre Gramfort < alexandre.gra

[Scikit-learn-general] Announcing scikit-learn-contrib

2016-03-30 Thread Mathieu Blondel
Dear scikit-learners, The scikit-learn team is happy to announce the creation of scikit-learn-contrib, a github organization for gathering high-quality scikit-learn compatible projects. https://github.com/scikit-learn-contrib scikit-learn-contrib currently includes two projects: - lightning: ht

Re: [Scikit-learn-general] Confidence-weighted learning

2016-03-29 Thread Mathieu Blondel
Hi Daniel, I think CW is a bit outdated and also a bit too specific (it supports only the hinge loss). Algorithms like Adagrad are more generic. Thus, I think CW is not a good candidate for inclusion in scikit-learn. That said, I would welcome a contribution in lightning: https://github.com/sciki

Re: [Scikit-learn-general] Announcing lightning v0.1

2016-03-25 Thread Mathieu Blondel
With lightning, you can train linear models on large-scale data using recent state-of-the-art optimization algorithms which are too cutting-edge for including in scikit-learn (e.g., SDCA or SAGA). If you just want to train a logistic regression on 1000 samples, you don't need lightning :) Mathieu

Re: [Scikit-learn-general] Speed up Random Forest/ Extra Trees tuning

2016-03-21 Thread Mathieu Blondel
Related issue: https://github.com/scikit-learn/scikit-learn/issues/3652 On Tue, Mar 22, 2016 at 6:32 AM, Jacob Schreiber wrote: > It should if you're using those parameters. It's basically similar to > calculating the regularization path for LASSO, since these are also > regularization terms. I

Re: [Scikit-learn-general] "In-bag" for RandomForest*

2016-03-08 Thread Mathieu Blondel
If this function is generally useful, it might be a good idea to make it public. Mathieu On Wed, Mar 9, 2016 at 1:29 AM, Ariel Rokem wrote: > > On Mon, Mar 7, 2016 at 8:24 AM, Andreas Mueller wrote: > >> Hi Ariel. >> We are not storing them any more because of memory issues, but you can >> rec

Re: [Scikit-learn-general] load_svmlight_file value error

2016-02-12 Thread Mathieu Blondel
ts._svmlight_format._load_svmlight_file > (sklearn\datasets\_svmlight_format.c:2055) > > ValueError: could not convert string to float: > > > > But this time, it does not show any value after the error. Its blank. > Any idea why this is happening? > > > Gunjan >

Re: [Scikit-learn-general] load_svmlight_file value error

2016-02-12 Thread Mathieu Blondel
Hi Gunjan, Apparently the dataset is multi-label, so you need to use the multilabel=True option. http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html Mathieu On Fri, Feb 12, 2016 at 10:04 PM, Gunjan Dewan wrote: > Hi all, > > I am using the following datas

Re: [Scikit-learn-general] maximum and minimum regularization for NMF

2016-02-02 Thread Mathieu Blondel
I guess knowing the max alpha is useful to know where to start your grid search from. However, I think deriving max alpha for NMF should be more difficult since the problem is non-convex. Mathieu On Wed, Feb 3, 2016 at 7:40 AM, Vlad Niculae wrote: > Hi James, > > I'm not sure how useful a minim

Re: [Scikit-learn-general] Dynamic Time Warping Contribution

2015-12-07 Thread Mathieu Blondel
How do you plan to represent variable-length time series? Lists of 1d numpy arrays work but would be slow I guess. The ideal representation needs to be compatible with grid search and fast. Mathieu On Mon, Dec 7, 2015 at 10:35 AM, Dan Shiebler wrote: > Hello, > > I’m not sure if this is the cor

Re: [Scikit-learn-general] [ANN] Scikit-learn 0.17 released

2015-11-06 Thread Mathieu Blondel
ng > it for the future. > On Nov 6, 2015 04:05, "Mathieu Blondel" wrote: > >> It's a pity that people who contributed to the release are not listed >> anymore. >> >> Of course, congrats to everyone involved and in particular to our release >>

Re: [Scikit-learn-general] [ANN] Scikit-learn 0.17 released

2015-11-06 Thread Mathieu Blondel
It's a pity that people who contributed to the release are not listed anymore. Of course, congrats to everyone involved and in particular to our release managers :) M. On Fri, Nov 6, 2015 at 9:45 AM, Andreas Mueller wrote: > Hey everybody. > > I'm happy to announce the release of scikit-learn

Re: [Scikit-learn-general] Using logistic regression on a continuous target variable

2015-10-04 Thread Mathieu Blondel
I've seen logistic regression used in a regression setting in a few papers as well. A nice thing is that the predictions are mapped to [0, 1]. The correct way to add this to scikit-learn would be to add a regression class `LogisticRegressor` and rename the existing class to `LogisticClassifier`. T

Re: [Scikit-learn-general] Implementing the "Concordance correlation coefficient" in metrics

2015-09-08 Thread Mathieu Blondel
M, Andreas Mueller wrote: > > > On 09/08/2015 06:42 AM, Mathieu Blondel wrote: > >> Pearson correlation between y_true and y_pred is also a standard >> evaluation metric in genomic selection. In a sense, it can be seen as a >> ranking measure since y_true and y_pred d

Re: [Scikit-learn-general] Implementing the "Concordance correlation coefficient" in metrics

2015-09-08 Thread Mathieu Blondel
Pearson correlation between y_true and y_pred is also a standard evaluation metric in genomic selection. In a sense, it can be seen as a ranking measure since y_true and y_pred don't need to be equal: they only need to be collinear to achieve perfect correlation. +1 for adding pearson_correlation_

Re: [Scikit-learn-general] RFCC: duecredit citations for sklearn (and anything else you like ; ) )

2015-08-29 Thread Mathieu Blondel
On Sun, Aug 30, 2015 at 7:27 AM, Yaroslav Halchenko wrote: > > As long as installation is straightforward, I think it should be a minor > hurdle. It will be by default (Recommends) installed with scikit-learn, > pymvpa, > and any other related package I am maintaining in Debian/Ubuntu. It is > a

Re: [Scikit-learn-general] RFCC: duecredit citations for sklearn (and anything else you like ; ) )

2015-08-29 Thread Mathieu Blondel
Hi, Making it easier to properly cite relevant papers is something I would also really like to see addressed! I am a bit concerned that most people wouldn't want or wouldn't be able to install an external program, though. For this reason, I think the ideal solution should be web based. This could

Re: [Scikit-learn-general] Suggestion for Multiclass.py !

2015-08-18 Thread Mathieu Blondel
Hi Othman, Please send such comments to the mailing-list. Thanks, Mathieu On Tue, Aug 18, 2015 at 10:03 PM, Othman Soufan wrote: > Greetings Guys, > > First of all, I want to thank you for the nice efforts you put in this > very usable case of building and training models i.e. the case of many

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Mathieu Blondel
On Thu, Jul 30, 2015 at 11:38 PM, Andreas Mueller wrote: > I am mostly concerned about API explosion. > I take your point of PDF vs PMF. > Maybe predict_proba(X, y) is better. > Would you also support predict_proba(X, y) for classifiers (which would be > predict_proba(X)[np.arange(len(y)), y]) ?

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Mathieu Blondel
7/29/2015 02:58 AM, Jan Hendrik Metzen wrote: > >>>> Such a predict_proba_at() method would also make sense for Gaussian > >>>> process regression. Currently, computing probability densities for GPs > >>>> requires predicting mean and standard deviation (via

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-30 Thread Mathieu Blondel
He was asking about Linear Discriminant Analysis, not Latent Dirichlet Allocation. Mathieu On Thu, Jul 30, 2015 at 7:58 PM, Stylianos Kampakis < stylianos.kampa...@gmail.com> wrote: > Hi Sebastian, > > LDA is unsupervised. Supervised PCA finds components correlated with the > response variable.

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-28 Thread Mathieu Blondel
Regarding predictions, I don't really see what's the problem. Using GLMs as an example, you just need to do def predict(self, X): if self.loss == "poisson": return np.exp(np.dot(X, self.coef_)) else: return np.dot(X, self.coef_) A nice thing about Poisson regression is tha

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Mathieu Blondel
http://arxiv.org/abs/1301.3781 Submitted on 16 Jan 2013, last revised 7 Sep 2013 https://www.google.com/patents/US9037464 Filed on 15 March 2013 On Thu, Jul 2, 2015 at 4:03 AM, Matthieu Brucher wrote: > 2015-07-01 19:43 GMT+01:00 Andreas Mueller : > > > > > > On 07/01/2015 02:42 PM, Lars Buitin

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Mathieu Blondel
On Wed, Jul 1, 2015 at 8:43 PM, Dale Smith wrote: > Apparently so; here is a python/cython implementation. > > > > http://rare-technologies.com/deep-learning-with-word2vec-and-gensim/ > word2vec is *not* deep learning. The skip-gram model has been shown recently to reduce to a certain matrix fa

Re: [Scikit-learn-general] Library of pre-trained models

2015-06-30 Thread Mathieu Blondel
For unsupervised models that take a long time to train, such as deep learning or word2vec based feature extractors, this can be pretty useful. Regardless, a major issue is that we still haven't figured out how to robustly solve model persistence. Mathieu On Wed, Jul 1, 2015 at 4:53 AM, Andreas M

Re: [Scikit-learn-general] Warm_start on Random Forest Classifiers

2015-06-30 Thread Mathieu Blondel
To maximize accuracy, n_estimators should ideally be as high as possible, yet we would like to use a reasonable value to limit training and prediction times. The new warm_start option is a nice way to incrementally add more trees until you reach a satisfying accuracy. Warm start in linear models i

Re: [Scikit-learn-general] Estimator Overview / Summary

2015-06-06 Thread Mathieu Blondel
https://github.com/scikit-learn/scikit-learn/pull/804 Thanks for working on this! Mathieu On Sun, Jun 7, 2015 at 6:11 AM, Andy wrote: > Hi all. > I vaguely remember there once was an idea to add a page to the > documentation that shows all the different models and their > characteristics. > Wa

Re: [Scikit-learn-general] Sparse output from metrics.pairwise.cosine_similarity

2015-06-05 Thread Mathieu Blondel
Sounds like a good idea. PR welcome. Mathieu On Fri, Jun 5, 2015 at 8:41 PM, Jaidev Deshpande wrote: > Hello, > > I noticed that the cosine similarity function calls safe_sparse_dot, and > makes it produce a dense output. Would it be a good idea to expose the > dense_output argument of safe_sp

Re: [Scikit-learn-general] LogisticRegression: sample vs class weights

2015-04-20 Thread Mathieu Blondel
Last time I checked, liblinear didn't support sample weights, just class weights (one for positive samples and another for negative samples). Mathieu On Tue, Apr 21, 2015 at 5:56 AM, iBayer wrote: > Hi, > I was surprised to read that class weights are implemented via sampling > for LogisticReg

Re: [Scikit-learn-general] Adaline (adaptive linear neuron) classifier

2015-04-05 Thread Mathieu Blondel
On Mon, Apr 6, 2015 at 12:00 AM, Andy wrote: > Hi Sebastian. > First off, if this is a classification algorithm with sum of squared > errors, you can just do it using linear regression + OvRClassifier, right? > This is also what RidgeClassifier does, only in a smarter way (Cholesky decomposition

Re: [Scikit-learn-general] GSoC 2015: Global optimization based Hyper parameter optimization (SMAC)

2015-03-31 Thread Mathieu Blondel
On Wed, Apr 1, 2015 at 4:05 AM, Vlad Niculae wrote: > Hi Gael, > > > On 31 Mar 2015, at 14:01, Gael Varoquaux > wrote: > > > >> Why do you think the GP route is easier? > > > > Because we already have GPs. > We have a GP implementation but it's being rewritten... > Well, we already have rando

Re: [Scikit-learn-general] New Cython API for BLAS and LAPACK in SciPy

2015-03-27 Thread Mathieu Blondel
On Sat, Mar 28, 2015 at 9:25 AM, Sturla Molden wrote: > Mathieu Blondel wrote: > > > What is the best way to detect whether this functionality is available? > (in > > order to write code which works with older versions of SciPy too) > > To write code that works wit

Re: [Scikit-learn-general] New Cython API for BLAS and LAPACK in SciPy

2015-03-27 Thread Mathieu Blondel
This is really nice. Thanks for the heads up! What is the best way to detect whether this functionality is available? (in order to write code which works with older versions of SciPy too) Is there online documentation yet? Thanks, Mathieu On Sat, Mar 28, 2015 at 12:46 AM, Sturla Molden wrote:

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-26 Thread Mathieu Blondel
a custom metric, and > Spectral Clustering and Affinity Propagation can work with a [n_samples, > n_samples] affinity matrix. > > On Thu, Mar 26, 2015 at 12:08 PM, Mathieu Blondel > wrote: > >> >> >> On Thu, Mar 26, 2015 at 5:49 PM, Artem wrote: >> >>>

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-26 Thread Mathieu Blondel
On Thu, Mar 26, 2015 at 5:49 PM, Artem wrote: > 1. Right, forgot to add that parameter. Well, I can apply an RBF kernel to > get a similarity matrix from a distance matrix inside transform. > > 2. Usual transformer returns neither distance, nor similarity, but > transforms the input space so that

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-26 Thread Mathieu Blondel
other methods mention this > approach, too. > > Added an example to the proposal > <https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Metric-Learning-module#api>. > Names are a bit awkward, but couldn't think of better ones. > > On Thu, Mar 26, 2015

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-25 Thread Mathieu Blondel
posal to melange tomorrow, so if you have > comments — please reply. > > Also, if some of previous objections were not addressed, please repeat > them. ​I might have missed something. > > On Wed, Mar 25, 2015 at 5:05 AM, Mathieu Blondel > wrote: > >> I think the problem w

Re: [Scikit-learn-general] [GSoC 2015] Cross-validation and Meta-Estimators for semi-supervised learning

2015-03-24 Thread Mathieu Blondel
The part I am most enthusiastic about is fixing the CV generators, though this could be a merge nightmare since we are in the process of changing the API. We need it to figure out which modifications are most likely to get in first. Lars did some work on semi-supervised naive bayes. Since this is

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Mathieu Blondel
I think the problem with matrix-like Y is that Y would be symmetric. Thus for doing cross-validation one would need to select both rows and columns. This is why I suggested to add a _pairwise_y property like the _pairwise property that we use in kernel methods, e.g., https://github.com/scikit-learn

Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC (and my official application :-) )

2015-03-24 Thread Mathieu Blondel
Hi Lucas, Instead of creating a new thread every time, it would be nice if you could reply directly in the same thread. This would make the discussion easier to follow. (To do so you need to be fully subscribed to the ML. I'm guessing you may be subscribed to the digest version) Thanks, M. On W

Re: [Scikit-learn-general] Pearson Correlation Similarity Measure

2015-03-23 Thread Mathieu Blondel
The cosine similarity and Pearson correlation are the same if the data is centered but are different in general. The routine in SciPy is between two vectors; metrics in scikit-learn are between matrices. So +1 to add Pearson correlation to scikit-learn. On Mon, Mar 23, 2015 at 3:24 PM, Gael Va

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-21 Thread Mathieu Blondel
I skimmed through this survey: http://arxiv.org/abs/1306.6709 For methods that learn a Mahalanobis distance, as Artem said, we can indeed compute the Cholesky decomposition of the learned precision matrix and use it to transform the data. Thus in this case metric learning can be seen as supervised

Re: [Scikit-learn-general] Scikit-learn sprint in Paris, April 2nd

2015-03-13 Thread Mathieu Blondel
How about we meet at ICML 2015 in Lille? I am personally planning to attend, although I might be a bit too tired for coding :). Mathieu On Fri, Mar 13, 2015 at 4:10 PM, Nelle Varoquaux wrote: > > There will also be a larger sprint in summer, right? > > If people are not too bored of Paris, why

Re: [Scikit-learn-general] [ANN] scikit-learn 0.16b1 is out!

2015-03-11 Thread Mathieu Blondel
On Tue, Mar 10, 2015 at 12:01 PM, Andy wrote: > On 03/09/2015 10:44 PM, Joel Nothman wrote: > > Congratulations! This has been a long time coming, and if not only for the > swathe of features it'll be great to see the documentation improvements > appearing on stable soon! > > My thoughts on dev

Re: [Scikit-learn-general] Perceptron implementation: Perceptron Rule or Stochastic Gradient Descent?

2015-02-23 Thread Mathieu Blondel
> not real. > > On Mon, Feb 23, 2015 at 6:35 PM, Andy wrote: > >> So indeed in the perceptron update yi_pred is {-1, 1}, not real, in >> sklearn, right? >> >> >> >> On 02/23/2015 08:35 AM, Mathieu Blondel wrote: >> >> Rosenblatt's Perceptron

Re: [Scikit-learn-general] Perceptron implementation: Perceptron Rule or Stochastic Gradient Descent?

2015-02-23 Thread Mathieu Blondel
Rosenblatt's Perceptron is a special case of SGD, see: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/tests/test_perceptron.py The perceptron loss leads to sparser weight vectors than the hinge loss in the sense that it updates the weight vector less aggressively (on

Re: [Scikit-learn-general] CV scores vs scores on a manual split

2015-02-20 Thread Mathieu Blondel
On Fri, Feb 20, 2015 at 6:57 AM, Andy wrote: > You give the roc_auc_score the result of "predict". You should give it > the result of "predict_proba". > > This came up already quite a bit, not sure how we can avoid people making > this mistake. > We can encourage people to use the scorer API mo

Re: [Scikit-learn-general] same cross validation score with different parameter configurations

2015-02-18 Thread Mathieu Blondel
Use the source, Luke https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py#L540 M. On Thu, Feb 19, 2015 at 7:24 AM, Pagliari, Roberto wrote: > When different parameter configurations produce the same CV score, how > does sklearn select the best parameters (I’m mostly

Re: [Scikit-learn-general] GSoC2015 topics

2015-02-12 Thread Mathieu Blondel
A grid-search related project could be useful: - multiple metric support (e.g., find the best model w.r.t. f1 score and the best model w.r.t. AUC) - data independent cv iterators ( https://github.com/scikit-learn/scikit-learn/issues/2904) - anything else? Mathieu On Thu, Feb 12, 2015 at 5:53 PM,

Re: [Scikit-learn-general] GSoC2015 topics

2015-02-12 Thread Mathieu Blondel
+1 on the CCA / PLS refactoring, but this would require a student who is already well versed on these subjects. Mentoring could be an issue as well. Mathieu On Thu, Feb 12, 2015 at 4:14 PM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > On Thu, Feb 12, 2015 at 02:10:11AM -0500, Ronnie

Re: [Scikit-learn-general] Adding Barnes-Hut t-SNE

2014-12-27 Thread Mathieu Blondel
On Thu, Dec 25, 2014 at 4:59 AM, Andy wrote: > I recently read about the approximation and I think it would be a great > addition. > Do you think it makes sense to include it via an ``algorithm`` paramter to > tSNE? > I totally agree with what Kyle said about demonstrating speedups and > approx

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Mathieu Blondel
As you mentioned popular methods from scikit-learn-contrib could be promoted to scikit-learn. Conversely, methods which became obsolete in scikit-learn could move to scikit-learn-contrib to lower the maintenance burden. Mathieu On Thu, Dec 4, 2014 at 12:26 AM, Mathieu Blondel wrote: >

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Mathieu Blondel
> > On Wed, Dec 3, 2014 at 5:25 AM, Joel Nothman > wrote: > >> >> I agree. We should ammend this sentence to say that if the paper is an >>> clear-cut improvement on top of a very used method, it should be >>> examinded. >> >> >> Done <h

Re: [Scikit-learn-general] AdaptiveSGD

2014-12-03 Thread Mathieu Blondel
A compromise would be to just implement the Cython routine in a separate file, while sharing the same file for the pure Python side. That said, using a separate class for Adagrad would allow to get rid of irrelevant hyper-parameters. Some code from the SGD module can probably be factorized and OVR

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Mathieu Blondel
On Wed, Dec 3, 2014 at 4:09 PM, Joel Nothman wrote: > Hi Tom, > > Anyone is welcome to publish their implementations in a format compatible > with scikit-learn's estimators. However, the centralised project already > takes a vast amount of work (almost all of it unpaid) to maintain, even > while

Re: [Scikit-learn-general] design of scorer interface

2014-11-28 Thread Mathieu Blondel
On Sat, Nov 29, 2014 at 11:33 AM, Aaron Staple wrote: > Hi Mathieu, > > Thanks for the information you’ve provided about the ridge implementation > and your suggestions for scoring rankings. > > First off, I’d like to try and contain the scope of the project I’m > working on. Would it be reasonab

Re: [Scikit-learn-general] design of scorer interface

2014-11-28 Thread Mathieu Blondel
I forgot to mention that in "Ridge", decision_function is an alias for predict, precisely to allow grid searching against AUC and other ranking metrics. M. On Sat, Nov 29, 2014 at 12:50 AM, Mathieu Blondel wrote: > > > On Sat, Nov 29, 2014 at 12:29 AM, Michael Eickenberg

Re: [Scikit-learn-general] design of scorer interface

2014-11-28 Thread Mathieu Blondel
assume that all regressors inherit from RegressorMixin. M. Michael > > On Fri, Nov 28, 2014 at 4:05 PM, Mathieu Blondel > wrote: > >> Here's a proof of concept that introduces a new method "predict_score": >> >> https://github.com/mblondel/scikit-lea

Re: [Scikit-learn-general] design of scorer interface

2014-11-28 Thread Mathieu Blondel
side the scorer to detect if an estimator is a regressor and use predict instead of predict_proba / decision_function. This assumes that the estimator inherits from RegressorMixin and therefore, the code must depend on scikit-learn. M. On Fri, Nov 28, 2014 at 7:40 PM, Mathieu Blondel wrote: >

Re: [Scikit-learn-general] design of scorer interface

2014-11-28 Thread Mathieu Blondel
On Fri, Nov 28, 2014 at 5:14 PM, Aaron Staple wrote: > [...] > However, I tried to run a couple of test cases with 0-1 predictions for > RidgeCV and classification with RidgeClassifierCV, and I got some error > messages. It looks like one reason for this is that > LinearModel._center_data can con

Re: [Scikit-learn-general] GPs in sklearn

2014-11-25 Thread Mathieu Blondel
On Wed, Nov 26, 2014 at 2:37 AM, Andy wrote: > > What I think would be great to have is gradient based optimization of > the kernel parameters +1 This is one of the most appealing features of GPs IMO. Mathieu -- Downl

[Scikit-learn-general] NIPS

2014-11-18 Thread Mathieu Blondel
Hi, Anyone from the mailing-list going to NIPS this year? See you there, Mathieu -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboard

Re: [Scikit-learn-general] design of scorer interface

2014-10-28 Thread Mathieu Blondel
Different metrics require different inputs (results of predict, decision_function, predict_proba). To avoid branching in the grid search and cross-validation, we thus introduced the scorer API. A scorer knows what kind of input it needs and calls predict, decision_function, predict_proba as needed.

Re: [Scikit-learn-general] design of scorer interface

2014-10-27 Thread Mathieu Blondel
In addition to out-of-bag scores and multi-metric grid search, there is also LOO scores in the ridge regression module, as pointed out by Michael. Option 4 seems like the best option to me. We keep __call__(self, estimator, X, y) for backward compatibility and because it is sometimes more conveni

Re: [Scikit-learn-general] error when using linear SVM with AdaBoost

2014-10-04 Thread Mathieu Blondel
On Sat, Oct 4, 2014 at 1:09 AM, Andy wrote: > > I'm pretty sure that is wrong, unless you use the "decision_function" > and not "predict_proba" or "predict". > Mathieu said "predict" is used. Then it is still like a (very old > school) neural network with a thresholding layer, > and not like a li

Re: [Scikit-learn-general] error when using linear SVM with AdaBoost

2014-10-03 Thread Mathieu Blondel
-27 4:51 GMT+02:00 Mathieu Blondel : > > This is because LinearSVC doesn't support sample_weight. > > > > I added a new issue for raising a more explicit error message: > > https://github.com/scikit-learn/scikit-learn/issues/3711 > > > > BTW, a linear co

Re: [Scikit-learn-general] error when using linear SVM with AdaBoost

2014-09-27 Thread Mathieu Blondel
aboost. And it doesn't seem to improve upon a single linear SVM, see the link below. I used SVC(kernel="linear") since it supports sample_weight. http://mblondel.org/images/adaboost.png M. On Sat, Sep 27, 2014 at 3:22 PM, Andy wrote: > On 09/27/2014 04:51 AM, Mathieu Blondel wrote

Re: [Scikit-learn-general] train_test_split return values

2014-09-26 Thread Mathieu Blondel
On Fri, Sep 19, 2014 at 5:32 AM, Pagliari, Roberto wrote: > When using train_test_split, is the output a reference to the input data, > or a deep copy? > Well, try to modify the output and see if the original data got modified. Then you get the answer to your question. M. --

Re: [Scikit-learn-general] error when using linear SVM with AdaBoost

2014-09-26 Thread Mathieu Blondel
This is because LinearSVC doesn't support sample_weight. I added a new issue for raising a more explicit error message: https://github.com/scikit-learn/scikit-learn/issues/3711 BTW, a linear combination of linear models is a linear model itself. So you can't learn a better model than a LinearSVC(

Re: [Scikit-learn-general] Group Lasso

2014-09-23 Thread Mathieu Blondel
`CDClassifier` in my project lightning supports group-lasso for multi-class classification: http://www.mblondel.org/lightning/generated/lightning.classification.CDClassifier.html#lightning.classification.CDClassifier Groups are defined as the class weights for each feature and cannot be changed.

Re: [Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

2014-09-21 Thread Mathieu Blondel
On Sun, Sep 21, 2014 at 2:04 AM, Olivier Grisel wrote: > 2014-09-20 8:04 GMT-07:00 Mathieu Blondel : > > > > I recently re-implemented gradient boosting [2]. > > I am interested in your feedback in implementing trees with numba. Is > it easy to reach the speed the sciki

Re: [Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

2014-09-21 Thread Mathieu Blondel
On Sun, Sep 21, 2014 at 1:55 AM, Olivier Grisel wrote: > On a related note, here is an implementeation of Logistic Regression > applied to one-hot features obtained from leaf membership info of a > GBRT model: > > > http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/In

Re: [Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

2014-09-21 Thread Mathieu Blondel
Hi Ken, On Sun, Sep 21, 2014 at 4:16 AM, c TAKES wrote: > > Understandable that scikit-learn wants to focus on more mature algorithms, > so perhaps I'll spend my efforts more on writing a python wrapper for > Johnson and Zhang's implementation of RGF, at least for now. Personally I > do think i

Re: [Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

2014-09-20 Thread Mathieu Blondel
cision > tree algorithms. > > Ken > > > > > > > > On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer < > peter.prettenho...@gmail.com> wrote: > >> The only reference I know is the Regularized Greedy Forest paper by >> Johnson and Zhang [1] >&

Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-17 Thread Mathieu Blondel
Andy, Indeed, this will mostly depend on the number of public utils we have. However, using submodules can help structure our public utils. M. On Wed, Sep 17, 2014 at 6:32 PM, Andy wrote: > On 09/15/2014 03:40 PM, Mathieu Blondel wrote: > >> lightning is using the fol

Re: [Scikit-learn-general] Sparse Gradient Boosting & Fully Corrective Gradient Boosting

2014-09-16 Thread Mathieu Blondel
Could you give a reference for gradient boosting with fully corrective updates? Since the philosophy of gradient boosting is to fit each tree against the residuals (or negative gradient) so far, I am wondering how such fully corrective update would work... Mathieu On Tue, Sep 16, 2014 at 9:16 AM

Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-15 Thread Mathieu Blondel
rator @deprecated_util to automate the task. Mathieu On Sat, Sep 13, 2014 at 11:22 AM, Mathieu Blondel wrote: > We should survey what other packages use. I'll have a look at what > lightning uses later. > > Mathieu > > On Sat, Sep 13, 2014 at 2:23 AM, Andy wrote: > &g

Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-12 Thread Mathieu Blondel
gt; everything ^^) > > Also we need to add utils to the References then. > No idea how to decide what should be public and what not, though. > > > > On 09/08/2014 04:01 PM, Mathieu Blondel wrote: > > Maintaining backward compatibility for a subset of the utils only means

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Mathieu Blondel
On Mon, Sep 8, 2014 at 11:55 PM, Gilles Louppe wrote: > I am rather -1 on making this a transform. There has many ways to come > up with proximity measures in forest -- In fact, I dont think > Breiman's is particularly well designed. > I think this is actually an argument for non-inclusion in th

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Mathieu Blondel
This could be a transform method added to RandomForestClassifier / RandomForestRegressor. On Mon, Sep 8, 2014 at 11:14 PM, Gilles Louppe wrote: > Hi Luca, > > This may not be the fastest implementation, but random forest > proximities can be computed quite straightforwardly in Python given > our

Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-08 Thread Mathieu Blondel
Maintaining backward compatibility for a subset of the utils only means that from now on we will have to decide whether an util deserves to be public or not. While we are at it, I would rather make it explicit and use an underscore prefix for private utils and no prefix for public utils. This can b

Re: [Scikit-learn-general] partial-fit in gradient boosting

2014-08-31 Thread Mathieu Blondel
> Is there any other way through which I can train GradientBoostingRegressor for this dataset? No, not yet. However, our implementation of gradient boosting has a `subsample` option for using a subset of the data when building each tree (this is called stochastic gradient boosting in the literatu

Re: [Scikit-learn-general] sparse datasets loading

2014-08-31 Thread Mathieu Blondel
from the network. > > > 2014-08-31 10:56 GMT+02:00 Mathieu Blondel : > > Do you store zero entries explicitly in your CSV format? CSV doesn't >> strike me as the best choice for representing sparse data... >> >> M. >> >> >> On Sun, Aug 31, 2014

Re: [Scikit-learn-general] sparse datasets loading

2014-08-31 Thread Mathieu Blondel
Do you store zero entries explicitly in your CSV format? CSV doesn't strike me as the best choice for representing sparse data... M. On Sun, Aug 31, 2014 at 5:21 PM, Eustache DIEMERT wrote: > @Lars, shouldn't the last line of the for loop be > > indptr.append(indptr[-1]+len(nonzero)) > > rat

Re: [Scikit-learn-general] Optimal Subset Selection Code Contribution

2014-08-21 Thread Mathieu Blondel
There was a thread on the mailing-list a while ago on instance reduction methods. It was decided to not include such methods for the time being as changing n_samples is not supported by transformers or pipelines. It is also not clear yet how such methods would play with grid search, for instance.

Re: [Scikit-learn-general] Random Subspace Ensemble Method

2014-08-17 Thread Mathieu Blondel
I believe random subspace ensembles are subsumed by the BaggingClassifier / BaggingRegressor estimators. See the class documentation. The proportion of features used is controlled by max_features. M. On Mon, Aug 18, 2014 at 8:51 AM, Dayvid Victor wrote: > Hello Everybody, > > I was looking for

Re: [Scikit-learn-general] Libsvm, probabilities and weights

2014-08-12 Thread Mathieu Blondel
sample_weights in scikit-learn comes from a libsvm patch: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances So it would seem like probability calibration was omitted from this patch :-( When our calibration module is ready, we could handle the calibration post-processing o

Re: [Scikit-learn-general] ElasticNet for classification

2014-07-24 Thread Mathieu Blondel
On Fri, Jul 25, 2014 at 1:46 AM, Alexandre Gramfort < alexandre.gramf...@telecom-paristech.fr> wrote: > > indeed but squared loss is cheap to use and can reach pretty good > classif performance in practice. > Indeed the squared loss works surprisingly well in practice for classification and it ha

Re: [Scikit-learn-general] sparse matrices with LinearSVC

2014-07-24 Thread Mathieu Blondel
On Thu, Jul 24, 2014 at 2:46 PM, Pagliari, Roberto wrote: > I also tried to import sparse.LinearSVC, but it says svm has no module > named sparse…. > > > I don't know where you get your documentation but sparse.LinearSVC has been removed like 3 years ago... :-) Mathieu --

Re: [Scikit-learn-general] Beta regression

2014-07-22 Thread Mathieu Blondel
statsmodel has a GLM module but apparently no beta regression. There is also a scikit-learn compatible wrapper around the GLM module here: https://github.com/jcrudy/glm-sklearn Mathieu On Mon, Jul 21, 2014 at 10:54 PM, Gavin Gray wrote: > Checking the documentation it looks like Scikit-learn

Re: [Scikit-learn-general] Evaluation measure for imbalanced data

2014-07-22 Thread Mathieu Blondel
AUC (area under the roc curve) is commonly used for imbalanced binary classification problems. The AUC is the probability that your classifier will rank a positive sample higher than a negative sample (where the ranking is computed using the "decision_function" scores). In scikit-learn, it is imple

Re: [Scikit-learn-general] Confidence score for each prediction from regressor

2014-07-22 Thread Mathieu Blondel
On Wed, Jul 23, 2014 at 4:47 AM, Peter Prettenhofer < peter.prettenho...@gmail.com> wrote: > > An alternative is to use a GradientBoostingRegressor with quantile loss to > generate prediction intervals (see [1]) -- only for the keen - i've once > used that unsuccessfully in a Kaggle comp. Its not

Re: [Scikit-learn-general] ElasticNet for classification

2014-07-22 Thread Mathieu Blondel
from sklearn.multiclass import OneVsRestClassifier clf = OneVsRestClassifier(ElasticNet()) should work. This is tested here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_multiclass.py#L168 For setting the parameters by grid-search, you need to use the "estimator__

Re: [Scikit-learn-general] higher accuracy with non scaled data

2014-07-10 Thread Mathieu Blondel
s to start with? > > Thanks > -- > Sheila > > On 8 July 2014 17:02, Mathieu Blondel wrote: > >> >> >> >> On Tue, Jul 8, 2014 at 11:27 PM, Sheila the angel > > wrote: >> >>> First I scaled the complete data-set and then splitting it

Re: [Scikit-learn-general] higher accuracy with non scaled data

2014-07-08 Thread Mathieu Blondel
On Tue, Jul 8, 2014 at 11:27 PM, Sheila the angel wrote: > First I scaled the complete data-set and then splitting it in test and > train data. > You should not pre-process the data before splitting it. Just ask yourself how you would use your model in practice. In a real-world setting, you woul

Re: [Scikit-learn-general] Retrieve the coefficients of fitted polynomial using LASSO

2014-06-29 Thread Mathieu Blondel
Hi Fernando, On Sun, Jun 29, 2014 at 1:53 PM, Fernando Paolo wrote: > Hello, > > I must be missing something obvious because I can't find the "actual" > coefficients of the polynomial fitted using LassoCV. That is, for a 3rd > degree polynomial > > p = a0 + a1 * x + a2 * x^2 + a3 * x^3 > > I w

  1   2   3   4   5   6   >