Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Joel Nothman
This would be much clearer if you provided some code, but I think I get what you're saying. The final GridSearchCV model is trained on the full training set, so the fact that it perfectly fits that data with random forests is not altogether surprising. What you can say about the parameters is

Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Joel Nothman
On 7 May 2016 at 19:12, Matthias Feurer wrote: > 1. Return the fit and predict time in `grid_scores_` > This has been proposed for many years as part of an overhaul of grid_scores_. The latest attempt is currently underway at

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
'll attempt a more rigorous test later this week and report > back. Thanks! > > Juan. > > On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman <joel.noth...@gmail.com> > wrote: > >> It's hard to believe this is a software problem rather than a data >> problem. If y

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
It's hard to believe this is a software problem rather than a data problem. If your data was accidentally a duplicate of the dataset, you could certainly get 100%. On 13 April 2016 at 10:10, Juan Nunez-Iglesias wrote: > Hallelujah! I'd given up on this thread. Thanks for

Re: [Scikit-learn-general] [scikit-learn-general] Why sklearn RandomForest model take a lot of disk space after save?

2016-04-11 Thread Joel Nothman
Yes, there are no doubt more efficient ways to store forests, but it seems unlikely to be a worthwhile investment. I think this is a documentation rather than an engineering issue. We frequently get issues raised that relate to "size": runtime, memory consumption, model size on disk,

Re: [Scikit-learn-general] weighted kernel density estimation

2016-04-10 Thread Joel Nothman
I think you should submit these changes as a pull request. Thanks, Jared. On 8 April 2016 at 21:17, Jared Gabor wrote: > I recently modified the kernel density estimation routines in > sklearn/neighbors to include optional weighting of the training samples (to > make

Re: [Scikit-learn-general] Binary Classifier Evaluation Metrics

2016-03-26 Thread Joel Nothman
ifiers and I'm taking into account only classifiers > that are returning 'Yes'. So I could make multilabelled classification with > my own dataset. > > I can evaluate precision, recall and f-measure values for each classifier(for > each category) but how can I test my all dataset(all cl

Re: [Scikit-learn-general] Binary Classifier Evaluation Metrics

2016-03-24 Thread Joel Nothman
OneVsRestClassifier already implements Binary Relevance. What is unclear about our documentation on model evaluation and metrics? On 25 March 2016 at 00:13, Enise Basaran wrote: > Hi everyone, > > I want to learn binary classifier evaluation metrics please. I implemented

Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

2016-03-23 Thread Joel Nothman
I think all the scikit-learn devs know that the serialisation available in scikit-learn is inadequate, and recommend storing training data and model parameters. Designing a serialisation format that is robust to future changes is a huge engineering effort, and is likely to result in one of: (a) a

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
And I lied that none of the scikit-learn estimators define their own get_params. Of course the following do: VotingClassifier, Kernel (and subclasses), Pipeline and FeatureUnion On 23 March 2016 at 15:04, Joel Nothman <joel.noth...@gmail.com> wrote: > something like the following ma

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
something like the following may suffice: def get_params(self, deep=True): out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) out['w2v_clusters'] = self.w2v_clusters return out On 23 March 2016 at 15:01, Joel Nothman <joel.noth...@gmail.com> wrote: > Hi Fre

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
Hi Fred, We use the __init__ signature to get the list of parameters that (a) can be set by grid search; (b) need to be copied to a cloned instance of the estimator (with any fitted model discarded) in constructing ensembles, cross validation, etc. While none of the scikit-learn library of

Re: [Scikit-learn-general] Feature selection != feature elimination?

2016-03-14 Thread Joel Nothman
Currently there is no automatic mechanism for eliminating the generation of features that are not selected downstream. It needs to be achieved manually. On 15 March 2016 at 08:05, Philip Tully wrote: > Hi, > > I'm trying to optimize the time it takes to make a prediction with

Re: [Scikit-learn-general] Restrictions on feature names when drawing decision tree

2016-03-13 Thread Joel Nothman
We should probably be escaping feature names internally. It's easy to forget that graphviz supports HTML-like markup. On 14 March 2016 at 08:00, Andreas Mueller wrote: > Try escaping the &. > > On 03/12/2016 02:57 PM, Raphael C wrote: > > The code snippet should have been > >

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman
86 > > (I hope I got it right this time!) > > In any case, I am not finding any literature describing this, and I am > also not proposing to add it to sickit-learn, just wanted to get some info > whether this is implemented or not. Thanks! :) > > > > > On Mar 8, 2016

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman
is is actually very similar to the F1 > score. But instead of computing the harmonic mean between “precision and > the true positive rate), we compute the harmonic mean between "precision > and true negative rate" > > > On Mar 8, 2016, at 6:40 PM, Joel Nothman <joel.noth...

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman
(Although multiloutput accuracy is reasonable to support.) On 9 March 2016 at 12:29, Joel Nothman <joel.noth...@gmail.com> wrote: > Firstly, balanced accuracy is a different thing, and yes, it should be > supported. > > Secondly, I am correct in thinking you're talkin

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman
I've not seen this metric used (references?). Am I right in thinking that in the binary case, this is identical to accuracy? If I predict all elements to be the majority class, then adding more minority classes into the problem increases my score. I'm not sure what this metric is getting at. On 8

Re: [Scikit-learn-general] Problem with parallel processing in randomSearch

2016-02-23 Thread Joel Nothman
What estimator(s) are you searching over? How big is your data? On 24 February 2016 at 06:15, Stylianos Kampakis < stylianos.kampa...@gmail.com> wrote: > Hi everyone, > > Sometimes, when I am using random search with n_jobs>1 the processing > stops. I am on a Mac. I went through some discussions

Re: [Scikit-learn-general] reproducible error : memory Error in scikit learn's dbscan

2016-02-18 Thread Joel Nothman
If not stack overflow, the appropriate venue for such questions is the scikit-learn-general mailing list. The current dbscan implementation is by default not memory efficient, constructing a full pairwise similarity matrix in the case where kd/ball-trees cannot be used (e.g. with sparse

Re: [Scikit-learn-general] BIRCH: merge subclusters

2016-02-07 Thread Joel Nothman
It's not clear *why* you're doing this. The model will automatically recluster the subclusters after identifying them, as long as you specify either a number of clusters or a clustering model to the n_clusters parameter. Can you fit this post-processing into that "final clustering" framework? On

Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Joel Nothman
How many distinct words are in your dataset? On 27 January 2016 at 00:21, Rockenkamm, Christian < c.rockenk...@stud.uni-goettingen.de> wrote: > Hallo, > > > I have question concerning the Latent Dirichlet Allocation. The results I > get from using it are a bit confusing. > > At first I use about

Re: [Scikit-learn-general] Use of safe functions such as safe_sqr

2016-01-13 Thread Joel Nothman
safe_sqr applies when its operand may be a sparse matrix. In theory this could be true of coef_, but I don't think this is tested as often as it might be. But, in general, you should not take what is done in any particular piece of code to be indicative of best practice. There are often multiple

Re: [Scikit-learn-general] figuring out the steps needed to achieve a result

2016-01-10 Thread Joel Nothman
I think you've misunderstood this one, Sören. This sounds like it is a structured learning problem, where the steps are the "target" of the learning task, and the result is the input example. Take, for instance, the natural language processing task of dependency parsing. The "result" of some

Re: [Scikit-learn-general] Dropping Python 2.6 compatibility

2016-01-04 Thread Joel Nothman
I have many times committed coded and had to fix for python 2.6. FWIW: features that I have had to remove include format strings with implicit arg numbers, set literals, dict comprehensions, perhaps ordered dicts / counters. We are already clandestinely using argparse in benchmark code. Most of

Re: [Scikit-learn-general] Import error for Robust scaler

2015-12-01 Thread Joel Nothman
But check that the version you are using in the appropriate Python instance is correct. For example: python -c 'import sklearn; print(sklearn.__version__)' On 2 December 2015 at 16:24, Sumedh Arani wrote: > Greetings!! > > I've used pip install --upgrade

Re: [Scikit-learn-general] "Need Review" tag

2015-12-01 Thread Joel Nothman
Labels weren't available for PRs until relatively recently. I think the status and its meaning would be clearer with such tags. On 2 December 2015 at 15:16, Andreas Mueller wrote: > Yeah that was the intention of [MRG]. Though it might be easier to > filter by tag. > No strong

Re: [Scikit-learn-general] classification metrics understanding

2015-11-28 Thread Joel Nothman
If you are treating your Logistic Regression output as binary (i.e. not using predict_proba or decision_function), could you please provide the confusion matrix? On 26 November 2015 at 05:06, Herbert Schulz wrote: > Hi, i think i have some "missunderstanding" due to the

Re: [Scikit-learn-general] Nesting of stratified crossvalidation

2015-10-28 Thread Joel Nothman
Changes to support this case have recently been merged into master, and an example is on its way: https://github.com/scikit-learn/scikit-learn/issues/5589 I think you should be able to run your code by importing GridSearchCV, cross_val_score and StratifiedShuffleSplit from the new

Re: [Scikit-learn-general] BIRCH algorithm global step

2015-10-14 Thread Joel Nothman
Yes, simply set n_clusters=KMeans(). In fact, it's a pity we don't have an example of this feature in the examples gallery and contributions are welcome! On 14 October 2015 at 23:27, Dženan Softić wrote: > Hi, > > I would like to change the global step of BIRCH algorithm to

Re: [Scikit-learn-general] How to optimize a random forest for out of sample prediction

2015-10-07 Thread Joel Nothman
RFECV will select features based on scores on a number of validation sets, as selected by its cv parameter. As opposed to that StackOverflow query, RFECV should now support RandomForest and its feature_importances_ attribute. On 7 October 2015 at 18:16, Raphael C wrote: > I

Re: [Scikit-learn-general] Comparing multiple Machine Learning models by ROC

2015-10-06 Thread Joel Nothman
See http://scikit-learn.org/stable/auto_examples/plot_roc.html On 6 October 2015 at 17:56, aravind ramesh wrote: > Dear All, > > I want to compare my new svm model generated with already published model. > > I generated required features and got the prediction labels for

Re: [Scikit-learn-general] [New feature] sklearn to PMML

2015-10-01 Thread Joel Nothman
Hi Mira, I think the community is very interested in this work, but you might consider collaborating with https://github.com/alex-pirozhenko/sklearn-pmml. Its support for models is limited to trees and their ensembles, but it also includes a test harness (

Re: [Scikit-learn-general] GridSearchCV using too many cores?

2015-09-24 Thread Joel Nothman
In terms of memory: I gather joblib.parallel is meant to automatically memmap large arrays (>100MB). However, then each subprocess will extract a non-contiguous set of samples from the data for training under a cross-validation regime. Would I be right in thinking that's where the memory blowout

Re: [Scikit-learn-general] Preparing the 0.17 release

2015-09-21 Thread Joel Nothman
And anyone looking for a small contribution to make could take on https://github.com/scikit-learn/scikit-learn/issues/5281 On 22 September 2015 at 10:24, Andreas Mueller wrote: > The list is currently pretty long: > >

Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Joel Nothman
A reflective response without a clear opinion: I'll admit to rarely-if-ever using function versions, and suspect they frequently have limited utility over the estimator interface. Occasionally they even wrap the estimator interface, so they're not going to provide the efficiency advantages Gaël

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-29 Thread Joel Nothman
:33 PM, Joel Nothman joel.noth...@gmail.com wrote: +1 On 28 August 2015 at 04:23, Andreas Mueller t3k...@gmail.com wrote: I think it would be fine to enable it now without support in all solvers. On 8/27/2015 11:29 AM, Valentin Stolbunov wrote: Joel, I see you've done some work in that PR

Re: [Scikit-learn-general] RFCC: duecredit citations for sklearn (and anything else you like ; ) )

2015-08-29 Thread Joel Nothman
A Cite me with duecredit sash on the opposite corner to Fork me on github? ;) On 30 August 2015 at 14:36, Mathieu Blondel math...@mblondel.org wrote: On Sun, Aug 30, 2015 at 7:27 AM, Yaroslav Halchenko s...@onerussian.com wrote: As long as installation is straightforward, I think it

Re: [Scikit-learn-general] issue with pipeline always giving same results

2015-08-27 Thread Joel Nothman
The randomisation only changes the order of the data, not the set of data points. On 27 August 2015 at 22:44, Andrew Howe ahow...@gmail.com wrote: I'm working through the tutorial, and also experimenting kind of on my own. I'm on the text analysis example, and am curious about the relative

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-27 Thread Joel Nothman
them in the other two solvers via the rough steps I outlined earlier? On Wed, Aug 26, 2015 at 9:59 PM, Andy t3k...@gmail.com wrote: On 08/26/2015 09:29 PM, Joel Nothman wrote: I agree. I suspect this was an unintentional omission, in fact. Apart from which, sample_weight support

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-26 Thread Joel Nothman
I agree. I suspect this was an unintentional omission, in fact. Apart from which, sample_weight support in liblinear could be merged from https://github.com/scikit-learn/scikit-learn/pull/2784 which is dormant, and merely needs some core contributors to show interest in merging it... On 27

Re: [Scikit-learn-general] Persisting models

2015-08-20 Thread Joel Nothman
I suspect supporting PMML import is a separate and low-priority project. Higher priority is support for transformers (in pipelines / feature unions), other predictors, and tests that verify the model against an existing PMML predictor. On 21 August 2015 at 01:37, Dale Smith dsm...@nexidia.com

Re: [Scikit-learn-general] Persisting models

2015-08-19 Thread Joel Nothman
Frequently the suggestion of supporting PMML or similar is raised, but it's not clear whether such models would be importable in to scikit-learn, or how to translate scikit-learn transformation pipelines into its notation without going mad, etc. Still, even a library of exporters for individual

Re: [Scikit-learn-general] Persisting models

2015-08-19 Thread Joel Nothman
See https://github.com/scikit-learn/scikit-learn/issues/1596 On 19 August 2015 at 16:35, Joel Nothman joel.noth...@gmail.com wrote: Frequently the suggestion of supporting PMML or similar is raised, but it's not clear whether such models would be importable in to scikit-learn, or how

Re: [Scikit-learn-general] positive / nonnegative least angle regression estimators

2015-08-17 Thread Joel Nothman
Please make a pull request. This looks like a small and useful change, consistent with Lasso's support of non-negativity. On 18 August 2015 at 14:30, Michael Graber michigra...@gmail.com wrote: Dear all, I extended the lars_path, Lars and LarsLasso estimators in the scikit-learn

Re: [Scikit-learn-general] Gridsearch pickle error with scipy distributions

2015-08-15 Thread Joel Nothman
This is a known scipy deficiency. See https://github.com/scipy/scipy/pull/4821 and related issues. On 15 August 2015 at 05:37, Jason Sanchez jason.sanchez.m...@statefarm.com wrote: This code raises a PicklingError: from sklearn.datasets import load_boston from sklearn.pipeline import

Re: [Scikit-learn-general] DecisionTreeClassifier refusing to split

2015-08-15 Thread Joel Nothman
While it's not bad to have more people know the internals of the tree code, ideally people shouldn't *have* to. Do you have any hints for how documentation could better serve users to not land in whatever trap you did? On 15 August 2015 at 16:03, Simon Burton si...@arrowtheory.com wrote: My

Re: [Scikit-learn-general] scikit-learn Truck Factor

2015-08-12 Thread Joel Nothman
I find that list somewhat obscure, and reading your section on Code Authorship gives me some sense of why. All of those people have been very important contributors to the project, and I'd think the absence of Gaël, Andreas and Olivier alone would be very damaging, if only because of their

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Joel Nothman
calls during training. But that may most probably be compensated as the number of queries grow since 2**b * n_estimators is a constant time. I'll send a PR with proper refactoring. On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman joel.noth...@gmail.com wrote: Thanks, I look forward to this being

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Joel Nothman
on this but I think I'll need your or some other contributers' reviewing as well . I'll do this if it's possible. On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com wrote: @Maheshakya, will you be able to do work in the near future on speeding up the ascending phase instead? Or should

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Joel Nothman
the most fundamental component of LSHForest. On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote: (sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets) On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
What makes you think this is the main bottleneck? While it is not an insignificant consumer of time, I really doubt this is what's making scikit-learn's LSH implementation severely underperform with respect to other implementations. We need to profile. In order to do that, we need some sensible

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
, and makes the searchsorted calls run in log(n / (2 ** b)) time rather than log(n). It is also much more like traditional LSH. However, it complexifies the code, as we now have to consider two strategies for descent/ascent. On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote: What makes

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
(sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets) On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts

Re: [Scikit-learn-general] [scikit-learn-general] Possible bug in RFECV.fit?

2015-07-22 Thread Joel Nothman
This isn't directly a problem with RFECV, it's a problem with what you provided as an argument to `scoring`. I suspect you provided a function with signature fn(y_true, y_pred) - score, where what is required is a function fn(estimator, X, y_true) - score. See

Re: [Scikit-learn-general] Speed up transformation step with multiple 1 vs rest binary text classifiers.

2015-07-02 Thread Joel Nothman
TfidfVectorizer is just CountVectorizer followed by a TfidfTransformer. The Tfidf transformation tends to be cheap relative to tokenization which is independent of what corpus you want to calculate TF.IDF over. If I understand correctly, you can perform CountVectorizer on all of your documents,

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Joel Nothman
oh, I missed that one from Omer Levy's debunking word2vec series. Nice! On 1 July 2015 at 23:52, Mathieu Blondel math...@mblondel.org wrote: On Wed, Jul 1, 2015 at 8:43 PM, Dale Smith dsm...@nexidia.com wrote: Apparently so; here is a python/cython implementation.

Re: [Scikit-learn-general] RandomizedSearchCV error

2015-06-25 Thread Joel Nothman
It's a problem of excessive memory consumption due to a O(# possible parameter settings) approach to sampling from discrete parameter grids without replacement. The fix was merged into master only hours ago. Please feel free to work with master, or to cherry-pick febefb0 On 25 June 2015 at

Re: [Scikit-learn-general] Passing kwargs to pipeline predict

2015-06-25 Thread Joel Nothman
:47 PM, Joel Nothman joel.noth...@gmail.com wrote: What estimators have predict with multiple args? Without support for same in cross validation routines and scorers, isn't t easier to write this functionality in custom code as you need it, leaving the predictor off the Pipeline? On 25 June

Re: [Scikit-learn-general] What do SGDClassifier weights do mathematically?

2015-06-25 Thread Joel Nothman
Across models, weights should be implemented such that duplicating samples would give identical results to corresponding integer weights. That is true here, to my understanding, if we remove the stochasticity such that all identical samples have their update occur at once. On 25 June 2015 at

Re: [Scikit-learn-general] Passing kwargs to pipeline predict

2015-06-24 Thread Joel Nothman
What estimators have predict with multiple args? Without support for same in cross validation routines and scorers, isn't t easier to write this functionality in custom code as you need it, leaving the predictor off the Pipeline? On 25 June 2015 at 06:06, Michael Kneier michael.kne...@gmail.com

Re: [Scikit-learn-general] differences between metrics.classification_report and own function

2015-06-17 Thread Joel Nothman
To me, those numbers appear identical at 2 decimal places. On 17 June 2015 at 23:04, Herbert Schulz hrbrt@gmail.com wrote: Hello everyone, i wrote a function to calculate the sensitivity,specificity, ballance accuracy and accuracy from a confusion matrix. Now i have a Problem, I'm

Re: [Scikit-learn-general] differences between metrics.classification_report and own function

2015-06-17 Thread Joel Nothman
, or is the precision in this case the sensitivity? On 17 June 2015 at 15:29, Andreas Mueller t3k...@gmail.com wrote: Yeah that is the rounding of using %2f in the classification report. On 06/17/2015 09:20 AM, Joel Nothman wrote: To me, those numbers appear identical at 2 decimal places

Re: [Scikit-learn-general] Incrementally Printing GridSearch Results

2015-06-15 Thread Joel Nothman
I think it gets a bit noisier when using n_jobs != 1, as verbose is passed to joblib.Parallel. I agree that it's not a very controllable or well-documented setting. On 16 June 2015 at 13:24, Adam Goodkind a.goodk...@gmail.com wrote: Right. Thank you. I guess I was just overwhelmed by the amount

Re: [Scikit-learn-general] silhouette_score and silhouette_samples

2015-06-15 Thread Joel Nothman
See the sample_size parameter: silhouette score can be calculated on a random subset of the data, presumably for efficiency. Feel free to submit a PR improving the docstring. On 16 June 2015 at 13:54, Sebastian Raschka se.rasc...@gmail.com wrote: Hi, all, I am a little bit confused about the

Re: [Scikit-learn-general] Sample weighting in RandomizedSearchCV

2015-06-09 Thread Joel Nothman
Until sample_weight is directly supported in Pipeline, you need to prefix `sample_weight` by the step name with '__'. So for Pipeline([('a', A()), ('b', B())] use fit_params={'a__sample_weight': sample_weight, 'b__sample_weight': sample_weight} or similar. HTH On 10 June 2015 at 03:57, José

[Scikit-learn-general] my silence

2015-05-31 Thread Joel Nothman
Just a quick note that I've been silent lately because I've been Busy With Life, but also because github was notifying an email address hosted at my previous employer, which was deactivated a fortnight ago. If there were issues that sought my particular attention, please let me know.

Re: [Scikit-learn-general] how to know which feature is informative or redundant in make_classification()?

2015-05-28 Thread Joel Nothman
As at http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html Prior to shuffling, `X` stacks a number of these primary informative features, redundant linear combinations of these, repeated duplicates of sampled features, and arbitrary noise for and

Re: [Scikit-learn-general] how to know which feature is informative or redundant in make_classification()?

2015-05-28 Thread Joel Nothman
noise in flip_y) across classes with respect to the informative features. On 28 May 2015 at 19:57, Joel Nothman joel.noth...@gmail.com wrote: As at http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html Prior to shuffling, `X` stacks a number

Re: [Scikit-learn-general] Grid search error

2015-05-17 Thread Joel Nothman
Sorry, I meant https://github.com/scikit-learn/scikit-learn/issues/4301 On 18 May 2015 at 12:10, Joel Nothman joel.noth...@gmail.com wrote: Sorry, grid search (and similar) does not support clusterers. This probably should be formally tracked as an issue. https://github.com/scikit-learn

Re: [Scikit-learn-general] Grid search error

2015-05-17 Thread Joel Nothman
Sorry, grid search (and similar) does not support clusterers. This probably should be formally tracked as an issue. https://github.com/scikit-learn/scikit-learn/issues/4040 might be helpful to you. On 18 May 2015 at 11:56, Jitesh Khandelwal jk231...@gmail.com wrote: I have recently been using

Re: [Scikit-learn-general] Divisive Hierarchical Clustering

2015-05-17 Thread Joel Nothman
Hi Sam, I think this could be interesting. You could allow for learning parameters on each sub-cluster by accepting a transformer as a parameter, then using sample = sklearn.base.clone(transformer).fit_transform(sample). I suspect bisecting k-means is notable enough and different enough for

Re: [Scikit-learn-general] Why don't we support Neural Network Algorithms?

2015-05-06 Thread Joel Nothman
What Sebastian and Ronnie said. Plus: there are multiple off-the-shelf neural net pull requests in the process of review, notably those by Issam Laradji for GSoC 2014. Extreme Learning Machines and Multilayer Perceptrons should be merged Real Soon Now. On 7 May 2015 at 14:58, Ronnie Ghose

Re: [Scikit-learn-general] clustering on unordered set

2015-04-30 Thread Joel Nothman
The algorithm isn't the issue so much as defining a metric that measures the distance or affinity between items, or else finding a way to reduce your data to a more standard metric space. I have for instance clustered sets of objects by first minhashing them (an approximate dim reduction for

Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread Joel Nothman
Yes, this is not a probabilistic method. On 29 April 2015 at 14:56, C K Kashyap ckkash...@gmail.com wrote: Works like a charm. Just noticed though that the max value is sometimes more than 1.0 is that okay? Regards, Kashyap On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman joel.noth

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
elaborate on the code please? What would be dataset.target_names and dataset.target in my case - http://lpaste.net/131649 Regards, Kashyap On Wed, Apr 29, 2015 at 3:08 AM, Joel Nothman joel.noth...@gmail.com wrote: This shows the newsgroup name and highest scoring topic for each doc. zip

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
On Wed, Apr 29, 2015 at 9:45 AM, Joel Nothman joel.noth...@gmail.com wrote: Highest ranking topic for each doc is just np.argmax(nmf.transform(tfidf), axis=1). This is because nmf.transform http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
This shows the newsgroup name and highest scoring topic for each doc. zip(np.take(dataset.target_names, dataset.target), np.argmax(nmf.transform(tfidf), axis=1)) I think something based on this should be added to the example. On 29 April 2015 at 07:01, Andreas Mueller t3k...@gmail.com wrote:

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Joel Nothman
I assume you have checked that combine_train_test_dataset produces data of the correct dimensions in both X and y. I would be very surprised if the problem were not in PAA, so check it again: make sure that you test that PAA().fit(X1).transform(X2) gives the transformation of X2. The error seems

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Joel Nothman
I suspect this method is underreported by any particular name, as it's a straightforward greedy search. It is also very close to what I think many researchers do in system development or report in system analysis, albeit with more automation. In the case of KNN, I would think metric learning

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-19 Thread Joel Nothman
On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com wrote: On 04/16/2015 05:49 PM, Joel Nothman wrote: I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-16 Thread Joel Nothman
Although I note that I've got LaTeX compilation errors, so I'm not sure how Andy compiles this. On 16 April 2015 at 20:25, Joel Nothman joel.noth...@gmail.com wrote: I've proposed a better chapter ordering at https://github.com/scikit-learn/scikit-learn/pull/4602... On 16 April 2015 at 03:48

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-16 Thread Joel Nothman
I've proposed a better chapter ordering at https://github.com/scikit-learn/scikit-learn/pull/4602... On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com wrote: Hi. Yes, run make latexpdf in the doc folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: Thanks, Andy! How do

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Joel Nothman
for the n_candidates with the lowest hamming distances. This should achieve a pretty good sweet spot of performance, with just a bit of Cython. Daniel On 04/16/2015 12:18 AM, Joel Nothman wrote: Once we're dealing with large enough index and n_candidates, most time is spent in searchsorted

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
I agree this is disappointing, and we need to work on making LSHForest faster. Portions should probably be coded in Cython, for instance, as the current implementation is a bit circuitous in order to work in numpy. PRs are welcome. LSHForest could use parallelism to be faster, but so can (and

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604... On 16 April 2015 at 14:24, Joel Nothman joel.noth...@gmail.com wrote: Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman joel.noth

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
. Try d500, n_points10, I don't remember the switchover point. The documentation should make this clear, but unfortunately I don't see that it does. On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote: I agree this is disappointing, and we need to work on making LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote: ball tree is not vectorized in the sense of SIMD, but there is Python/numpy overhead in LSHForest that is not present in ball tree. I

Re: [Scikit-learn-general] reconstruct image after preprocessing

2015-04-14 Thread Joel Nothman
Use preprocessing.StandardScaler()'s transform and inverse_transform methods. HTH! On 14 April 2015 at 19:06, Souad Chaabouni chaabouni_so...@yahoo.fr wrote: Hello, Im beginner, I have an image which i done a preprocessing with sklearn img_scaled = preprocessing.scale(img) my question

Re: [Scikit-learn-general] Help: Getting ValueError @precision_recall_fscore_support

2015-04-13 Thread Joel Nothman
Ignoring the class label 'O' from evaluation will be possible with #4287 https://github.com/scikit-learn/scikit-learn/pull/4287 merged On 14 April 2015 at 11:43, namma igloo nammaig...@outlook.com wrote: I was removing the class 'O' (other) from labels as given in the python-crfsuite example

Re: [Scikit-learn-general] Micro and Macro F-measure for text classification

2015-04-11 Thread Joel Nothman
Or report macro and micro in classification_report. Micro is equivalent to accuracy for multiclass without #4287 https://github.com/scikit-learn/scikit-learn/pull/4287. On 10 April 2015 at 01:00, Andreas Mueller t3k...@gmail.com wrote: Hi Jack. You mean in the classification report? That

Re: [Scikit-learn-general] Artificial Neural Networks

2015-04-07 Thread Joel Nothman
Issam Laradji implemented a multilayer perceptron and extreme learning machines for last year's GSoC. Both are awaiting final reviews before being merged. They should be functional and can be found in the Issue Tracker. On 7 April 2015 at 21:09, Vlad Ionescu ionescu.vl...@gmail.com wrote:

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman
On 25 March 2015 at 00:01, Gael Varoquaux gael.varoqu...@normalesup.org wrote: To make this more concrete, the MetricLearner().metric_ estimator would require specialised set_params or clone behaviour, I assume. I.e. it involves hacking API fundamentals. It's more a general principle of

Re: [Scikit-learn-general] GSoC 2015 Proposal: Multiple Metric Learning

2015-03-24 Thread Joel Nothman
- https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements . Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman) Any feedback/suggestions/additions/deletions would be awesome

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman
On 24 March 2015 at 23:56, Gael Varoquaux gael.varoqu...@normalesup.org wrote: So I just thought: what if metric learners will have an attribute `metric` Before adding features and API entries, I'd really like to focus on having a 1.0 release, with a fixed API that really solves the

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman
Hi Artem, I've taken a look at your proposal. I think this is an interesting contribution, but I suspect your proposal is far too ambitious: - The proposal doesn't well account for the need to receive reviews and alter the PR in accordance. This is especially so because you are

Re: [Scikit-learn-general] Student looking to contribute to scikit-learn

2015-03-21 Thread Joel Nothman
GSOC isn't the best way to get started. We recommend you get to know the code structure, API and development process by starting with issues labelled https://github.com/scikit-learn/scikit-learn/labels/Easy. In general, look through the Issue Tracker and find something of interest, or which has

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-21 Thread Joel Nothman
Are there any objections on Joel's variant of y? It serves my needs, but is quite different from what one can usually find in scikit-learn. FWIW It'll require some changes to cross-validation routines. On 22 March 2015 at 11:54, Artem barmaley@gmail.com wrote: Are there any objections

Re: [Scikit-learn-general] GSoC2015 Hyperparameter Optimization topic

2015-03-19 Thread Joel Nothman
This is off-topic, but I should note that there is a patch at https://github.com/scikit-learn/scikit-learn/pull/2784 awaiting review for a while now... On 20 March 2015 at 08:16, Charles Martin charlesmarti...@gmail.com wrote: I would like to propose extending the linearSVC package by

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-18 Thread Joel Nothman
I don't know a lot about metric learning either, but it sounded like from your initial statement that fit(X, D) where D is the target/known distance between each point in X might be appropriate. I have no idea if this is how it is formulated in the literature (your mention of asymmetric metrics

  1   2   3   4   >