Re: [scikit-learn] KMeans with cosine similarity

2016-06-02 Thread Joel Nothman
In short, no, monkey patching cosine_similarity in place of euclidean_distances will not work. See for instance this StackOverflow post: http://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric You could try out this Kernel KMeans impl

Re: [scikit-learn] memory efficient feature extraction

2016-06-06 Thread Joel Nothman
> - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). There is a fast path for stacking a series of CSR matrices. On 6 June 2016 at 22:19, Rom

[scikit-learn] The culture of commit squashing

2016-06-13 Thread Joel Nothman
For the last few years, there's been a notion that we should squash PRs down to a single commit before merging. Squashing can give a cleaner commit history, and avoid overrepresentation of minor work given silly commit count metrics used by Github and others. I'm not sure if there are other motivat

Re: [scikit-learn] The culture of commit squashing

2016-06-13 Thread Joel Nothman
, it helps tracking down code in the commit history in the > long run, but that’s just my personal opinion. > > Best, > Sebastian > > > > On Jun 13, 2016, at 9:36 PM, Joel Nothman > wrote: > > > > For the last few years, there's been a notion that we sho

Re: [scikit-learn] Adding BM25 relevance function to sklearn.feature_extraction.text

2016-06-13 Thread Joel Nothman
Hi Basil, Scikit-learn isn't a library for information retrieval. The question is: how useful is the BM25 feature reweighting in a machine learning context? This has been previously discussed at https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg11353.html. The whole threa

Re: [scikit-learn] The culture of commit squashing

2016-06-14 Thread Joel Nothman
Sounds good to me. Thank goodness someone reads the documentation! On 14 June 2016 at 19:51, Alexandre Gramfort < alexandre.gramf...@telecom-paristech.fr> wrote: > > We could stop squashing during development, and use the new > Squash-and-Merge > > button on GitHub. > > What do you think? > > +1

Re: [scikit-learn] adding BM25 relevance function

2016-06-15 Thread Joel Nothman
; range of values.Maybe you could even say TFIDF and BM25 are the same >> > equation except BM25 has a few additional hyperparameters (b and k). >> > >> > So is the advantage that BM25 provides for large diverse corpora with >> it? >> > or is it marginal? Pe

Re: [scikit-learn] The culture of commit squashing

2016-06-18 Thread Joel Nothman
erging person to make a call whether a >>> squash is a better >>> logical unit than all the commits. >>> I would set like a soft limit at ~5 commits or something. If your PR has >>> more than 5 separate >>> big logical units, it's probably too big.

Re: [scikit-learn] Code review

2016-06-20 Thread Joel Nothman
I think perhaps that FAQ should be updated to say "nag if needed"! Apologies for that delay, @olologin. Yes, it would be good if we had a better way of organising reviewing priorities, but between github's feature set and the distributed nature of the core dev team, we land up relying on chance, o

Re: [scikit-learn] Welcome Loic Esteve (@lesteve) as a new core contributor

2016-06-23 Thread Joel Nothman
Thanks for some great work so far, Loic; I'm looking forward to more of your well-considered comments and contributions! On 23 June 2016 at 18:52, Arnaud Joly wrote: > Congratulation Loic! > > Arnaud > > > On 23 Jun 2016, at 07:57, Gael Varoquaux > wrote: > > > > Hi, > > > > I'd like to welcome

Re: [scikit-learn] Code review

2016-06-23 Thread Joel Nothman
On 23 June 2016 at 22:47, Raghav R V wrote: > > "nag if needed"! > > I always assume it to be an implicit advice ;P > I could tell. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How do we define a distance metric's parameter for grid search

2016-06-27 Thread Joel Nothman
Hi Hugo, Andrew's approach -- using a list of dicts to specify multiple parameter grids -- is the correct one. However, Andrew, you don't need to include parameters that will be ignored into your parameter grid. The following will be effectively the same: params = [{'kernel':['poly'],'degree':[1

Re: [scikit-learn] Spherical Kmeans #OT

2016-06-27 Thread Joel Nothman
(Since Normalizer is applied to each sample independently, the Pipeline/Transformer mechanism doesn't actually provide any benefit over sklearn.preprocessing.normalize) On 28 June 2016 at 09:20, Michael Eickenberg wrote: > You could do > > from sklearn.pipeline import make_pipeline > from sklear

Re: [scikit-learn] Spherical Kmeans #OT

2016-06-27 Thread Joel Nothman
eping it in the package?) > > > On Tuesday, June 28, 2016, Joel Nothman wrote: > >> (Since Normalizer is applied to each sample independently, the >> Pipeline/Transformer mechanism doesn't actually provide any benefit over >> sklearn.preprocessing.normalize)

Re: [scikit-learn] How do we define a distance metric's parameter for grid search

2016-06-28 Thread Joel Nothman
> > I tried to do this but am having errors. Seems like I need to use the > 'metric_params' parameter but I cannot get it right. Here are some of the > attempts I made: > > {'metric': ['wminkowski'], 'metric_params':[{ 'w': [0.01, 0.1, 1, 10, > 100], 'p': [1,2,3,4,5]}], 'n_neighbors': list(k_range)

Re: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update)

2016-06-30 Thread Joel Nothman
I don't see what about BM25, at least as presented at https://en.wikipedia.org/wiki/Okapi_BM25, should prevent using CSR operations efficiently. Show us your code. On 1 July 2016 at 08:23, Basil Beirouti wrote: > Hello everyone, > > I have successfully created a few versions of the BM25Transform

Re: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text

2016-07-02 Thread Joel Nothman
it your Subject line so it is more specific >> than "Re: Contents of scikit-learn digest..." >> >> >> Today's Topics: >> >>1. Adding BM25 to sklearn.feature_extraction.text (Update) >> (Basil Beirouti) >>2. Re: Adding BM2

[scikit-learn] 0.18?

2016-07-04 Thread Joel Nothman
Has there been talk about a release? We've long-since merged the big changes to CV. Among the things seeming unfinished there is that where `cv` is available, `fit` should also support a `labels` parameter. That's not available in RFECV etc. There are some other nice features in the next release

Re: [scikit-learn] Using fit_intercept with sparse matrices

2016-07-04 Thread Joel Nothman
Jaidev is suggesting that fit_intercept=False makes no sense if the data is sparse. But I think that depends on your target variable. On 4 July 2016 at 22:11, Alexandre Gramfort < alexandre.gramf...@telecom-paristech.fr> wrote: > On Mon, Jul 4, 2016 at 12:13 PM, Jaidev Deshpande > wrote: > > My

Re: [scikit-learn] 0.18?

2016-07-06 Thread Joel Nothman
ole bunch of bug fixes that we need to do, though, > and maybe some other API changes / deprecations. > > I'll be more helpful in two weeks, when my book is submitted and scipy is > over. > > Best, > Andy > > > On 07/04/2016 06:49 PM, Joel Nothman wrote: > > H

Re: [scikit-learn] Bm25 pull request

2016-07-11 Thread Joel Nothman
CircleCI checks the documentation build (although apparently it ignores changes only to docstrings). Travis runs all tests on a linux system. AppVeyor tests on Windows. On 12 July 2016 at 08:11, Basil Beirouti wrote: > > Hi, > > Joel thanks for pointing out the indentation issue. I have fixed it

Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

2016-07-14 Thread Joel Nothman
This has been discussed numerous times. I suppose no one thinks supporting pickle only is great, but a custom dict is unmaintainable. The best we've got AFAIK (and it looks like it's getting better all the time) is a tool to convert one-w

Re: [scikit-learn] Is there any official position on PEP484/mypy?

2016-08-02 Thread Joel Nothman
I certainly see the benefit, and think we would benefit also from finding test coverage holes wrt input type. But I think without ndarray/sparse matrix type support, we're not going to be able to annotate most of our code in sufficient detail. On 2 August 2016 at 23:34, Daniel Moisset wrote: >

[scikit-learn] StackOverflow Documentation

2016-08-03 Thread Joel Nothman
StackOverflow has introduced its Documentation space, where scikit-learn is a covered subject: http://stackoverflow.com/documentation/scikit-learn. The project is a little interesting, and otherwise somewhat exasperating/tiring, given the overlap with our own documentation efforts, which we would l

Re: [scikit-learn] Building Scikit Learn in Win 7 64bits

2016-08-22 Thread Joel Nothman
You could also use ! pip install D:\_devs\Python01\scikit_learn\sklearn or indeed ! pip install git+https://github.com/scikit-learn/scikit-learn/ if you don't actually want to use the directory with the source code in it. On 22 August 2016 at 19:43, Olivier Grisel wrote: > The error message

Re: [scikit-learn] Fwd: inconsistency between libsvm and scikit-learn.svc results

2016-08-27 Thread Joel Nothman
I don't think we should assume that this is the only possible reason for inconsistency. Could you give us a small snippet of data and code on which you find this inconsistency? On 27 August 2016 at 23:42, elge...@gmail.com wrote: > So there is no possibility to reach a consistency? > > 2016-08-2

Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Joel Nothman
Or just running estimator.tree_.apply(X_train) and inferring from there. On 30 August 2016 at 13:22, Nelson Liu wrote: > estimator.tree_.value gives the constant prediction of the tree at each > node. Think of it as what the tree would output if that node was a leaf. > > I don't think we have a

Re: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!

2016-09-14 Thread Joel Nothman
You definitely did a lot of trawling through issues, and made some valuable LGTMs! On 15 September 2016 at 09:37, Andreas Mueller wrote: > > > On 09/14/2016 07:29 PM, Nelle Varoquaux wrote: > >> Thanks again for taking care of the release! >> N >> >> Always a pleasure, though it was mostly @ogri

Re: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!

2016-09-14 Thread Joel Nothman
> > PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding > the book ;) I hope the release date in October is fixed! :). Except that now it requires substantial revisions On 15 September 2016 at 09:43, Sebastian Raschka wrote: > Thanks for all the effort putting it toge

Re: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!

2016-09-14 Thread Joel Nothman
y Release drafts it looks like the changes are already > included — I think that 0.18 was the “catch” regarding release schedule :P > > > On Sep 14, 2016, at 9:54 PM, Joel Nothman > wrote: > > > > > > > > PS: Now that the 0.18 is (almost) out there, no ex

[scikit-learn] Github project management tools

2016-09-15 Thread Joel Nothman
One of the biggest issues with scikit-learn as a project is managing its backlog of issues; another is release scheduling. Some of this cannot be fixed as long as our model of voluntary contribution (with a couple of important exceptions) does not change. However, it may be worth considering the ne

Re: [scikit-learn] Github project management tools

2016-09-15 Thread Joel Nothman
er. I'm doubtful the > new github features will help. > They certainly already have tremendously hindered me in keeping up in the > couple of hours they've been online. > > There is still no way to mark a comment as addressed, and comments are > still more or less rand

Re: [scikit-learn] Github project management tools

2016-09-19 Thread Joel Nothman
On 17 September 2016 at 01:21, Gael Varoquaux wrote: > On Fri, Sep 16, 2016 at 09:14:12AM +1000, Joel Nothman wrote: > > One downside is that there does not yet seem to be a way to search for > > PRs with a specified level of approval (while searching for "MRG+1" > so

Re: [scikit-learn] Github project management tools

2016-09-19 Thread Joel Nothman
Another bot-able tool might be pinging inactive PRs to ask if they're being worked on, and labelling "Needs contributor" if there's no reply within n days...! On 20 September 2016 at 00:05, Joel Nothman wrote: > On 17 September 2016 at 01:21, Gael Varoquaux < >

Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

2016-09-19 Thread Joel Nothman
OneHotCoder has issues, but I think all you want here is ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide]))) Still, this seems like it is far from the intended use of OneHotEncoder (which should not really be stacked with LabelEncoder), so it's not surprising it's tricky. On

Re: [scikit-learn] Contribution project proposal

2016-09-20 Thread Joel Nothman
Have you searched the issue tracker for Stacking and the relationship between your proposal and others in the works? https://github.com/scikit-learn/scikit-learn/search?q=stacking&type=Issues&utf8=%E2%9C%93 On 21 September 2016 at 02:04, Iván Vallés Pérez wrote: > Hello, > > My name is Iván Val

Re: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40

2016-09-26 Thread Joel Nothman
Hi Arafin, You appear to be talking about a situation in which your dataset is divided into subsets in which the data are highly correlated (but perhaps conditionally independent given the subject / group identifier). In Scikit-learn 0.18 these might be called "grouped cross validation" strategies

Re: [scikit-learn] always Squash and Merge?

2016-09-28 Thread Joel Nothman
That's generally my approach too. Squash and merge unless you need a record of separate authorship. Squashing helps managing cherrypicking for releases, and ensuring what's new has decent coverage. On 29 September 2016 at 00:02, Andreas Mueller wrote: > Hey. > > This is a continuation of the di

Re: [scikit-learn] always Squash and Merge?

2016-09-28 Thread Joel Nothman
On 29 September 2016 at 01:47, Nelle Varoquaux wrote: > On 28 September 2016 at 08:18, Andreas Mueller wrote: > > > > > > On 09/28/2016 10:05 AM, Gael Varoquaux wrote: > >> > >> I am not against it. When I think about why I didn't use it, it was a > >> combination of laziness and lack of trust i

Re: [scikit-learn] Github project management tools

2016-09-29 Thread Joel Nothman
I agree that being able to identify which PRs are stalled on the contributor's part, which on reviewers' part, and since when, would be great. I'm not sure we've come up with a way that'll work though. In terms of backlog, I've wondered if just getting things into a spreadsheet would help: https:

Re: [scikit-learn] Github project management tools

2016-09-29 Thread Joel Nothman
I've put a column for that status in. Note: this has largely been generated with https://gist.github.com/jnothman/8eba0834acfd633c6d83b437f6f18c49 On 30 September 2016 at 00:16, Guillaume Lemaître wrote: > What do you think about splitting MRG and MRG+1 in two different column. > The scrolling

Re: [scikit-learn] Github project management tools

2016-09-29 Thread Joel Nothman
not in the > > list. It seems like a very worthwhile addition and the PR appears > > stalled at present. > > > > Raphael > > > > On 29 September 2016 at 15:05, Joel Nothman > wrote: > >> I agree that being able to identify which PRs are stalled

Re: [scikit-learn] ANN Scikit-learn 0.18 released

2016-09-29 Thread Joel Nothman
(this has been in drafts a few days and I'm sure there's plenty I've missed from the lists below) Well done, everyone! The size of this release - and the group of people that contributed to it - is even a bit overwhelming. Thanks for managing the release, Andy... and writing it up as a book! We'v

Re: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

2016-10-01 Thread Joel Nothman
Negative values are not really there to compensate for hash collisions. It's there because that makes the hashed vector space an approximation to the full vector space under inner product. On 2 October 2016 at 00:17, Roman Yurchak wrote: > On 01/10/16 15:34, Moyi Dang wrote: > > However, I don't

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-04 Thread Joel Nothman
Congratulations, Raghav! Thanks for your dedication, as a student and mentor in GSoC, but at all other times too! On 4 October 2016 at 19:14, Jaques Grobler wrote: > Congrats Raghav! > > 2016-10-03 21:25 GMT+02:00 Andreas Mueller : > >> Congrats, hope to see lot's more ;) >> >> >> On 10/03/2016

Re: [scikit-learn] Doubt regarding issue timeline

2016-10-12 Thread Joel Nothman
If you have a sense that the issue is urgent in some way, then give it up quickly if you've said you'd do it. Otherwise, it's okay to take a few weeks. Yes, it would be kind, if it looks like you won't be able to do it, to say you can't. Sorry there are no hard rules, but thanks for trying to cla

Re: [scikit-learn] Silhouette example - performance issue

2016-10-18 Thread Joel Nothman
And we can reduce any substantial performance issues by merging https://github.com/scikit-learn/scikit-learn/pull/7177 ... :) On 15 October 2016 at 00:55, Michael Eickenberg < michael.eickenb...@gmail.com> wrote: > Dear Anaël, > > if you wish, you could add a line to the example verifying this >

[scikit-learn] Towards 0.18.1

2016-10-19 Thread Joel Nothman
Due to a few substantial bugs in 0.18.0, we're hoping to release 0.18.1 around the end of the month. Help solving (and reviewing) the issues listed https://github.com/scikit-learn/scikit-learn/milestone/22 is welcome. In particular, an easy documentation issue at https://github.com/scikit-learn/sci

Re: [scikit-learn] Announcement: Scikit-learn 0.18.1 released!

2016-11-13 Thread Joel Nothman
Thanks, Andy. As Andy said, this upgrade is strongly recommended. Due to a long-term bug in Numpy (and insufficient testing on our part), the new model_selection.GridSearchCV etc could not be pickled. There were also issues with the use of iterators for cross-validation splitters. But there are a

Re: [scikit-learn] suggested classification algorithm

2016-11-14 Thread Joel Nothman
http://contrib.scikit-learn.org/imbalanced-learn/ might be of interest to you. On 14 November 2016 at 22:14, Thomas Evangelidis wrote: > Greetings, > > I want to design a program that can deal with classification problems of > the same type, where the number of positive observations is small bu

Re: [scikit-learn] Development workflow proposal: merge master instead of rebasing

2016-11-17 Thread Joel Nothman
Of course it can deal with this: "Squash and merge" just takes the diff between the master and the branch merged with master, and applies it as a fresh patch on master (borrowing author and timestamp). Think `git merge --squash` more than the squash feature of `git rebase --interactive`. On 18 Nov

Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-23 Thread Joel Nothman
Raghav's example of [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': 'sgd'}, {'solver': 'adam'}] was not correct. Should be [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': ['sgd']}, {'solver': ['adam']}] (Note all values of dicts are lists) On 23 Nov

Re: [scikit-learn] How to not recalculate transformer in a Pipeline?

2016-11-28 Thread Joel Nothman
A few brief points of history: - We have had PRs #3951 and #2086 that build memoising into Pipeline in one way or another. - Andy and I have previously discussed altern

Re: [scikit-learn] Problem with nested cross-validation example?

2016-11-28 Thread Joel Nothman
Briefly: clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score

Re: [scikit-learn] Problem with nested cross-validation example?

2016-11-28 Thread Joel Nothman
If that clarifies, please offer changes to the example (as a pull request) that make this clearer. On 29 November 2016 at 11:06, Joel Nothman wrote: > Briefly: > > clf = GridSearchCV > <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.Gr

Re: [scikit-learn] How to not recalculate transformer in a Pipeline?

2016-11-29 Thread Joel Nothman
But that the issue of model memoising isn't limited to pipeline. On 29 November 2016 at 18:11, Gael Varoquaux wrote: > On Tue, Nov 29, 2016 at 10:13:00AM +1100, Joel Nothman wrote: > >- We have had PRs #3951 > ><https://github.com/scikit-learn/scikit-lea

Re: [scikit-learn] Problem with nested cross-validation example?

2016-11-29 Thread Joel Nothman
be worth adding an explanatory >> figure like this to the docs to clarify? >> >> > On Nov 28, 2016, at 7:07 PM, Joel Nothman >> wrote: >> > >> > If that clarifies, please offer changes to the example (as a pull >> request) that make this

Re: [scikit-learn] Problem with nested cross-validation example?

2016-11-29 Thread Joel Nothman
ent job at obscuring this. I'll try and add some clarification in as >> comments later today. >> >> Cheers, >> >> d >> >> >> On 29/11/16 00:07, Joel Nothman wrote: >> >> If that clarifies, please offer changes to

Re: [scikit-learn] Problem with nested cross-validation example?

2016-11-29 Thread Joel Nothman
nt me to add comments that highlight this > issue? > > > On 29/11/16 10:48, Joel Nothman wrote: > > Wait an hour for the docs to build and you won't get artifact not found :) > > If you'd looked at the PR diff, you'd see I've modified the description to

Re: [scikit-learn] Bugs in Tree.py

2016-11-29 Thread Joel Nothman
"percentages" should be "fractions" or "proportions". On 30 November 2016 at 05:44, Nelson Liu wrote: > Hi, > I think this is working as the docs say; 1 is an integer and is thus > treated as a raw number of samples. If you wanted a percentage value of > 100%, you'd have to pass in the float 1.0

Re: [scikit-learn] Github project management tools

2016-12-05 Thread Joel Nothman
With apologies for starting the thread and then disappearing for a while (life got in the way, and when I came back I decided the issue backlog itself was more pressing): Of late, I mostly operate on a last-in-first-out basis, so I'm highly influenced by recent activity. This minimises communicati

Re: [scikit-learn] Github project management tools

2016-12-07 Thread Joel Nothman
And yet GitHub just rolled out a new "reviewers" field for assigning these things... On 7 December 2016 at 03:26, Raghav R V wrote: > +1 for self assigning PRs by reviewers... > > On Tue, Dec 6, 2016 at 4:19 PM, Andy wrote: > >> Thanks for your thoughts. >> I'm working in a similar mode, though

[scikit-learn] Bookmarklet to view documentation on CircleCI

2016-12-21 Thread Joel Nothman
At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've created a bookmarklet which, when viewing a pull request page for which the CircleCI build has finished, will identify the circle build number and open a new tab with the changed documentation files corresponding to that PR.

Re: [scikit-learn] Bookmarklet to view documentation on CircleCI

2016-12-21 Thread Joel Nothman
I hope it's useful to someone else. On 21 December 2016 at 21:03, Joel Nothman wrote: > At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've > created a bookmarklet which, when viewing a pull request page for which the > CircleCI build has finished, will

Re: [scikit-learn] Bookmarklet to view documentation on CircleCI

2016-12-21 Thread Joel Nothman
it to > the github interface. > > Gaël > > On Wed, Dec 21, 2016 at 09:03:59PM +1100, Joel Nothman wrote: > > I hope it's useful to someone else. > > > On 21 December 2016 at 21:03, Joel Nothman > wrote: > > > At https://gist.github.com/jnothman/bf76d0

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-26 Thread Joel Nothman
Hi Debu, Your post is terminologically confusing, so I'm not sure I've understood your problem. Where is the "different sample" used for scoring coming from? Is it possible it is more related to the training data than the test sample? Joel On 27 December 2016 at 05:28, Debabrata Ghosh wrote: >

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Joel Nothman
Your model is overfit to the training data. Not to say that it's necessarily possible to get a better fit. The default settings for trees lean towards a tight fit, so you might modify their parameters to increase regularisation. Still, you should not expect that evaluating a model's performance on

Re: [scikit-learn] KNeighborsClassifier and metric='precomputed'

2017-01-02 Thread Joel Nothman
n_indexed means the number of samples in the X passed to fit. It needs to be able to compare each prediction sample with each training sample. On 3 January 2017 at 07:44, Pedro Pazzini wrote: > Hi all! > > I'm trying to use a KNeighborsClassifier with precomputed metric. In it's > predict method

Re: [scikit-learn] modifying CV score

2017-01-04 Thread Joel Nothman
Well, it returns the equivalent of lambda estimator, X, y: estimator.score(X, y) On 5 January 2017 at 08:47, Jonathan Taylor wrote: > (Think this is right reply to from a digest... If not, apologies) > > Thanks for the pointers. From what I read on the API, I gather that for an > estimator wi

Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-07 Thread Joel Nothman
On 8 January 2017 at 08:36, Thomas Evangelidis wrote: > > > On 7 January 2017 at 21:20, Sebastian Raschka > wrote: > >> Hi, Thomas, >> sorry, I overread the regression part … >> This would be a bit trickier, I am not sure what a good strategy for >> averaging regression outputs would be. However

Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-07 Thread Joel Nothman
* > There is no problem, in general, with overfitting, as long as your > evaluation of an estimator's performance isn't biased towards the training > set. We've not talked about evaluation. > ___ scikit-learn mailing list scikit-learn@python.org https:/

Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-08 Thread Joel Nothman
Btw, I may have been unclear in the discussion of overfitting. For *training* the meta-estimator in stacking, it's standard to do something like cross_val_predict on your training set to produce its input features. On 8 January 2017 at 22:42, Thomas Evangelidis wrote: > Sebastian and Jacob, > >

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-01-09 Thread Joel Nothman
In terms of the bug fixes listed in the change-log, most seem non-urgent. I would consider pulling across #7954, #8006, #8087, #7872, #7983. But I also wonder whether we'd be better off sprinting towards a small 0.19 release. On 9 January 2017 at 20:48, Olivier Grisel wrote: > Hi all, > > I thin

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-01-11 Thread Joel Nothman
When the two versions deprecation policy was instituted, releases were much more frequent... Is that enough of an excuse? On 12 January 2017 at 03:43, Andreas Mueller wrote: > > > On 01/09/2017 10:15 AM, Gael Varoquaux wrote: > >> instead of setting up a roadmap I would rather just identify bugs

Re: [scikit-learn] Pipeline conventions for wrappers

2017-01-21 Thread Joel Nothman
I think you'll need to be more specific. What do you want a pipeline to do for you? On 21 January 2017 at 01:19, Aaditya Jamuar wrote: > Hi Guys, > > I am currently working on gensim (https://github.com/RaRe- > Technologies/gensim) , writing wrappers for Scikit-learn for easy > integration of LD

Re: [scikit-learn] Identify spectra with "marker"

2017-01-21 Thread Joel Nothman
Wrong mailing list? On 21 January 2017 at 02:52, Sebastian Illner < sebastian.ill...@imtek.uni-freiburg.de> wrote: > Hi guys, > I'm new to NIR-measurement as wenn as chemometrics. My current project > involvs the recognition of determined spectra (of a reference system) among > others. > The mate

Re: [scikit-learn] top N accuracy classification metric

2017-01-21 Thread Joel Nothman
There are metrics with that kind of input in sklearn.metrics.ranking. I don't have the time to look them up now, but there have been proposals and PRs for similar ranking metrics. Please search the issue tracker for related issues. Thanks, Joel On 21 January 2017 at 06:16, Johnson, Jeremiah wrote

Re: [scikit-learn] Need Corresponding indices array of values in each split of a DesicisionTreeClassifier

2017-02-07 Thread Joel Nothman
I don't think putting that array of indices in a visualisation is a great idea! If you use my_tree.apply(X) you will be able to determine which leaf each instance in X lands up at, and potentially trace up the tree from there. On 8 February 2017 at 01:26, Nixon Raj wrote: > > For Example, In th

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-02-07 Thread Joel Nothman
On 12 January 2017 at 08:51, Gael Varoquaux wrote: > On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: > > When the two versions deprecation policy was instituted, releases were > much > > more frequent... Is that enough of an excuse? > > I'd rather say

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-02-08 Thread Joel Nothman
e next release? > > Andrew > > On Jan 12, 2017 00:53, "Gael Varoquaux" > wrote: > > On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: > > When the two versions deprecation policy was instituted, releases were > much > > more frequent... Is that

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-02-08 Thread Joel Nothman
See also http://scikit-learn.org/stable/modules/classes.html#recently-deprecated On 9 February 2017 at 14:30, Joel Nothman wrote: > Not sure that this quite gives you a number, but: > > > $git checkout 0.18.1 > $ git grep -pwB1 0.19 sklearn | grep -ve ^- -e .csv: -e /tests/ >

Re: [scikit-learn] GSOC call for mentors

2017-02-19 Thread Joel Nothman
I am sure there are many people disappointed by the idea that we may not run with GSoC this year. On the one hand, we could – as Gaël has suggested – really benefit from having more people involved in the maintenance of scikit-learn, and GSoC provides a potential pathway for newcomers. On the other

Re: [scikit-learn] GSoC 2017

2017-02-27 Thread Joel Nothman
Hi Pradeep, we would usually only accept candidates who have shown their proficiency and understanding of our package and processes by making some contributions prior to this stage. you are certainly welcome to aim for GSoC 2018 by beginning to develop your familiarity and rapport now. cheers, Joel

Re: [scikit-learn] Clustering 4 dimensional data

2017-02-27 Thread Joel Nothman
What do your four dimensions mean? Can you reshape your data such that it can be seen as a collection of 1d vectors drawn independently from some distribution? On 28 February 2017 at 14:43, Rohan Koodli wrote: > I'm having trouble understanding how to cluster multidimensional data. > Specificall

Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

2017-03-15 Thread Joel Nothman
Trees are not a traditional choice for bag of words models, but you should make sure you are at least using the parameters of the random forest to limit the size (depth, branching) of the trees. On 16 March 2017 at 12:20, Sasha Kacanski wrote: > Hi, > As soon as number of trees and features goes

Re: [scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-15 Thread Joel Nothman
sklearn's (and hence liblinear's) intercept is not being used here, but a feature is added in Python to represent the bias, so it's being regularised in any case. On 16 March 2017 at 14:27, Sebastian Raschka wrote: > I think the liblinear solver (default in LogisticRegression) does > regularize

Re: [scikit-learn] GridsearchCV

2017-03-15 Thread Joel Nothman
If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. On 16 March 2017 at 15:46, Carlton Banks wrote:

Re: [scikit-learn] Intermediate results using gridsearchCV?

2017-03-19 Thread Joel Nothman
Not sure what you mean. Have you used cv_results_ On 18 March 2017 at 08:46, Carlton Banks wrote: > Is it possible to receive intermediate the intermediate result of a > gridsearchcv? > > instead getting the final result? > > > > ___ > scikit-learn mai

Re: [scikit-learn] Intermediate results using gridsearchCV?

2017-03-19 Thread Joel Nothman
> > On Sun, 19 Mar 2017 at 11:46 Joel Nothman wrote: > >> Not sure what you mean. Have you used cv_results_ >> >> On 18 March 2017 at 08:46, Carlton Banks wrote: >> >> Is it possible to receive intermediate the intermediate result of a >&

Re: [scikit-learn] Regarding GSoC projects and mentors

2017-03-22 Thread Joel Nothman
Hi Jeff, Given the timeframe, it would be difficult for us to have confidence in your abilities, having not seen your work and thus your understanding of scikit-learn conventions and review process. If you think applying this year is the right way to go, you should try to make contributions ASAP.

Re: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release

2017-03-25 Thread Joel Nothman
have no bandwidth to help. I will be able to help starting May 7th. > > > On 03/24/2017 05:26 PM, Raghav R V wrote: > > Hi, > > Are we still planning on an early April release for v0.19? Could we start > marking "blockers"? > > > > On Tue, Feb 21, 2017

Re: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

2017-04-18 Thread Joel Nothman
towards debugging, perhaps add the return_distances option On 16 Apr 2017 9:19 pm, "Evaristo Caraballo via scikit-learn" < scikit-learn@python.org> wrote: > I have been asked to implement a simple knn for text similarity analysis. > I tried by using sklearn.neighbors module. > The file to be anal

Re: [scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

2017-04-20 Thread Joel Nothman
The problem is the misuse of the label encoder. See https://github.com/scikit-learn/scikit-learn/issues/8767 On 20 April 2017 at 19:58, Alex Garel wrote: > I'm not totally sure of what you're trying to do, but here are some > remarks that may help you: > > 1. in modelfit = model.fit(count_vect,

Re: [scikit-learn] What if I don't want performance measures per each outcome class?

2017-04-24 Thread Joel Nothman
"Traditional" sensitivity is defined for binary classification only. Maybe micro-average is what you're looking for, but in the multiclass case without anything more specified, you'll merely be calculating accuracy. Perhaps quantiles of the scores returned by permutation_test_score will give you

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
What are the shapes of train_input and train_output? On 30 April 2017 at 12:59, Carlton Banks wrote: > I am currently trying to run some gridsearchCV on a keras model which has > multiple inputs. > The inputs is stored in a list in which each entry in the list is a input > for a specific channel

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
port multi-input data... > > El 30 abr 2017, a las 12:02, Joel Nothman > escribió: > > What are the shapes of train_input and train_output? > > On 30 April 2017 at 12:59, Carlton Banks wrote: > >> I am currently trying to run some gridsearchCV on a keras model which h

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
The shapes are > > print len(train_input)print train_input[0].shapeprint train_output.shape > 33(100, 8, 45, 3)(100, 1, 145) > > > 100 is the batch-size.. > > Den 30. apr. 2017 kl. 12.57 skrev Joel Nothman : > > Scikit-learn should accept a list as X to grid search and index

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
Do each of your 33 inputs have a batch of size 100? If you reshape your data so that it all fits in one matrix, and then split it back out into its 33 components as the first transformation in a Pipeline, there should be no problem. On 1 May 2017 at 10:17, Joel Nothman wrote: > Sorry, I do

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
lton Banks wrote: > How … batchsize could also be 1, I’ve just stored it like that. > > But how do reshape me data to be a matrix.. thats the big question.. is > possible? > > Den 1. maj 2017 kl. 02.21 skrev Joel Nothman : > > Do each of your 33 inputs have a batch of size

Re: [scikit-learn] gridsearchCV able to handle list of input?

2017-04-30 Thread Joel Nothman
) for single in train_input]) GridSearchCV(make_pipeline(tmi, my_predictor), ...) On 1 May 2017 at 13:19, Joel Nothman wrote: > Unless I'm mistaken about what we're looking at, you could use something > like: > > class ToMultiInput(TransformerMixin, BaseEstimato

  1   2   3   4   >