Re: [Scikit-learn-general] ELM and Deep Learning

2015-03-18 Thread Joel Nothman
There are other more specialised projects that facilitate modular neural networks. The idea in scikit-learn is to provide useful out-of-the-box components for well-established solutions to certain types of tasks that fit a simple interface. This often means limiting their flexible use from the

Re: [Scikit-learn-general] Multilayer perceptron module

2015-03-15 Thread Joel Nothman
I think #3306 (Extreme Learning Machines) needs review, and after that's merged, focus should return to the MLP PR. I've not been following either of those PRs extremely closely, but I gather that both are quite mature, but not small items for review. On 16 March 2015 at 07:53, Michael Eickenberg

Re: [Scikit-learn-general] [ANN] scikit-learn 0.16b1 is out!

2015-03-09 Thread Joel Nothman
Congratulations! This has been a long time coming, and if not only for the swathe of features it'll be great to see the documentation improvements appearing on stable soon! My thoughts on development priorities for the next release (and ideally to focus on before GSoC eats everyone's brains): We

Re: [Scikit-learn-general] Problem to reproduce the analysis of events co-attended by woman via manifold MDS

2015-03-03 Thread Joel Nothman
I think DSW_jaccard_matrix is a matrix of similarity (which is what Jaccard usually means) not of dissimilarity. Try negating it before MDS. On 3 March 2015 at 20:07, Jean-Baptiste Pressac jean-baptiste.pres...@univ-brest.fr wrote: Hello, I tried to reproduce the analysis of events

Re: [Scikit-learn-general] how does sklearn apply pipelines

2015-02-26 Thread Joel Nothman
And when some function f (such as predict) other than fit is called on the pipeline, it invokes transform on all the steps but the last, and on the last step calls f with the transformed data. On 27 February 2015 at 13:31, Sebastian Raschka se.rasc...@gmail.com wrote: It's actually quite

Re: [Scikit-learn-general] CV scores vs scores on a manual split

2015-02-21 Thread Joel Nothman
One way to encourage people to use the scorer API more would be to add a more direct interface like: def score(scoring, estimator, X, y=None, **kwargs): return get_scorer(scoring)(estimator, X, y, **kwargs) On 20 February 2015 at 20:58, Mathieu Blondel math...@mblondel.org wrote: On

Re: [Scikit-learn-general] set_params and get_params (and 1.0 API)

2015-02-19 Thread Joel Nothman
Almost as great an evil, but a possible solution, is to allow those step estimators to be retrievable by name through Pipeline.__getitem__... Only less evil than __getattr__ because the name conflict issues go away. On 20 February 2015 at 07:58, Gael Varoquaux gael.varoqu...@normalesup.org wrote:

Re: [Scikit-learn-general] set_params and get_params (and 1.0 API)

2015-02-19 Thread Joel Nothman
It only works because Pipeline overloads get_params. On 20 February 2015 at 09:17, Andy t3k...@gmail.com wrote: On 02/19/2015 12:58 PM, Gael Varoquaux wrote: The question is: can we do this without breaking our pipeline delegation mechanism that we use to set parameters during

Re: [Scikit-learn-general] custom scorer with parameters

2015-02-19 Thread Joel Nothman
Ties within a confidence interval happen in practice and it could be nice to have grid search use a model complexity criterion to select between insignificantly different top performers. But I think this is separate to the notion of scorer. It relies on custom logic beyond argmax to select the

Re: [Scikit-learn-general] set_params and get_params (and 1.0 API)

2015-02-18 Thread Joel Nothman
The overloading of get_params and set_params becomes more complex in #1769. I have also found cases (of helper meta-estimators / wrappers) that require the overloading of clone behaviour, though this is not yet supported. On 18 February 2015 at 18:14, Gael Varoquaux gael.varoqu...@normalesup.org

Re: [Scikit-learn-general] Feature selection and cross validation; and identifying chosen features

2015-02-11 Thread Joel Nothman
You could use grid2.best_estimator_.named_steps['feature_selection'].get_support(), or .transform(feature_names) instead of .get_support(). Note for instance that if you have a pipeline of multiple feature selectors, for some reason, .transform(feature_names) remains useful while .get_support()

Re: [Scikit-learn-general] GSoC2015 topics

2015-02-05 Thread Joel Nothman
I think adding partial_fit functions in general to as many algorithms as possible would be nice Which could be a project in itself, for someone open to breadth rather than depth. On 6 February 2015 at 06:43, Kyle Kastner kastnerk...@gmail.com wrote: IncrementalPCA is done (have to add

Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross validation estimate

2015-02-05 Thread Joel Nothman
With cv=5, only the training sets should overlap. Is this adjustment still appropriate? On 6 February 2015 at 06:44, Michael Eickenberg michael.eickenb...@gmail.com wrote: this is most probably due to the fact that 2 = sqrt(5 - 1), a correction to variance reduction incurred by the

Re: [Scikit-learn-general] Sharing objects between Python 2 and 3

2015-01-24 Thread Joel Nothman
, but would like to avoid that if possible! (these test RFs are in my repo.) I'm on a different computer right now so will submit pickle traceback later... But hoping there's a good joblib-based solution! =) Juan. On Fri, Jan 23, 2015 at 1:38 PM, Joel Nothman joel.noth...@gmail.com wrote

Re: [Scikit-learn-general] Sharing objects between Python 2 and 3

2015-01-22 Thread Joel Nothman
Could you provide the traceback when using pickle? The joblib error is about zipping, which should not be applicable there... On 23 January 2015 at 13:30, Juan Nunez-Iglesias jni.s...@gmail.com wrote: Nope, the Py2 RF was saved with joblib! The SO response might work for standard pickling

Re: [Scikit-learn-general] Elementary GridSearchCV question

2015-01-22 Thread Joel Nothman
That's not the learnt estimator. You're looking at the initial input (i.e. the parameters that are or are not changed during the search). The learnt estimators are cloned from that one, and the best is stored at clf.best_estimator_ (if refit=True). Cheers, Joel On 23 January 2015 at 12:20,

Re: [Scikit-learn-general] Different results using cross_val_score and StratifiedKFold

2015-01-19 Thread Joel Nothman
ROC AUC doesn't use binary predictions as its input; it uses the measure of confidence (or decision function) that each sample should be assigned 1. cross_val_score is correctly using decision_function to get these continuous values, and you should find its results replicated by using

Re: [Scikit-learn-general] Majority rule Ensemble classifier

2015-01-13 Thread Joel Nothman
I wonder if these ensembles, while common, are too non-standard. Are there well-analysed variants of these models in the literature, or standard ways to configure them? If not, perhaps this is best presented as an example rather than avaialable in the library... On 14 January 2015 at 13:21, Andy

Re: [Scikit-learn-general] Thanks for help

2015-01-10 Thread Joel Nothman
Hi Timothy, You are not setting random_state for train_test_split. Please check if this fixes the problem. - Joel On 10 January 2015 at 01:57, Timothy Vivian-Griffiths vivian-griffith...@cardiff.ac.uk wrote: Ok, well once again, thank you for your reply. I will provide some of my code here

Re: [Scikit-learn-general] Predicting with Pipeline

2015-01-08 Thread Joel Nothman
cross_val_score has created three different models for cross-validation. Which did you want to use to impute? After cross-validation you can fit the model on the whole dataset, although this may be bad practice depending on how you want to use the model. GridSearchCV is the common way to use

Re: [Scikit-learn-general] Dimension Requirements on train_test_split and GridSearchCV

2014-12-18 Thread Joel Nothman
I'm +1 for adding tests to ensure grid search meets usages that fall outside of the strict domains of scikit-learn's estimators. If users that apply it to problems of other shape (additional args, etc.) can write tests, or state their requirements, I think that would be valuable in ensuring

Re: [Scikit-learn-general] updating a model

2014-12-14 Thread Joel Nothman
If the estimator supports `partial_fit`, you can use that, repeatedly, instead of `fit`. See documentation: http://scikit-learn.org/stable/modules/scaling_strategies.html http://scikit-learn.org/stable/auto_examples/cluster/plot_dict_face_patches.html On 15 December 2014 at 14:55, Ady Wahyudi

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Joel Nothman
I agree. We should ammend this sentence to say that if the paper is an clear-cut improvement on top of a very used method, it should be examinded. Done http://scikit-learn.org/dev/faq.html. On 3 December 2014 at 20:07, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Wed, Dec 03,

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Joel Nothman
While anything is better than publishing an extended fork of the main repository, I would like to see someone cite an instance where a open-slather contrib repository has been particularly successful (especially one where diverse contributions are assured). In line with Gaël's experience of

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-03 Thread Joel Nothman
expensive. if i could do this in an easier manner, i wouldn't really ask for a common bleeding repo. cheers, satra On Wed, Dec 3, 2014 at 6:55 PM, Joel Nothman joel.noth...@gmail.com wrote: While anything is better than publishing an extended fork of the main repository, I would like to see

Re: [Scikit-learn-general] Exclusivity of scikit-learn

2014-12-02 Thread Joel Nothman
Hi Tom, Anyone is welcome to publish their implementations in a format compatible with scikit-learn's estimators. However, the centralised project already takes a vast amount of work (almost all of it unpaid) to maintain, even while adopting a very restrictive scope. Incorporating

Re: [Scikit-learn-general] design of scorer interface

2014-11-29 Thread Joel Nothman
So far I only have a strong opinion on not relying on the presence of decision_function or predict_proba to identify a classifier. Also, is the distinction we seek between classifiers and regressors, precisely, or between categorical and continuous predictors? (i.e. do we care that clusterers and

Re: [Scikit-learn-general] random forest prediction performance

2014-11-18 Thread Joel Nothman
This is generally the nature of working in numpy: operations are cheaper when they're done in bulk. On 18 November 2014 21:44, Lars Buitinck larsm...@gmail.com wrote: 2014-11-18 11:07 GMT+01:00 Nicola Sambin sam...@spaziodati.eu: - when I computed: for vector in vectors:

Re: [Scikit-learn-general] issues/nice to have in sklearn

2014-11-11 Thread Joel Nothman
Is https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aopen+is%3Aissue+label%3AEnhancement or https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+Feature%22 what you're looking for? On 12 November 2014 15:07, Pagliari, Roberto rpagli...@appcomsci.com

Re: [Scikit-learn-general] Fast Johnson-Lindenstrauss Transform

2014-10-29 Thread Joel Nothman
It would be nice to have it implemented in a sklearn.random_projections-compatible form, but is there reason to believe it is stable/popular enough for inclusion in the repo? On 30 October 2014 00:24, Michal Romaniuk michal.romaniu...@imperial.ac.uk wrote: Hi everyone, I'm thinking of adding

Re: [Scikit-learn-general] feature selection

2014-10-20 Thread Joel Nothman
*Roberto On 21 October 2014 13:14, Joel Nothman joel.noth...@gmail.com wrote: I assume Robert's query is about RFECV. On 21 October 2014 07:35, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hi, No expert here, either but there are also feature selection classes which compute

Re: [Scikit-learn-general] feature selection

2014-10-20 Thread Joel Nothman
I assume Robert's query is about RFECV. On 21 October 2014 07:35, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hi, No expert here, either but there are also feature selection classes which compute the score per feature. A simple example would be the f_classif, which in a very broad

Re: [Scikit-learn-general] Access data arriving at leaf nodes

2014-10-14 Thread Joel Nothman
What do you mean by all the values that make up a leaf node? If you mean all the samples, isn't apply sufficient? On 15 October 2014 06:20, M Asad masad@gmail.com wrote: Hi, I am kind of new to scikit, however I have learned a alot of things now. I am using

Re: [Scikit-learn-general] Suggestion: break up the metrics module

2014-10-14 Thread Joel Nothman
We had a plan to move out the model selection stuff. Presently that talked about moving scorers, but not necessarily the metrics underlying them On 15 October 2014 07:16, Lars Buitinck larsm...@gmail.com wrote: 2014-10-14 21:53 GMT+02:00 Robert Layton robertlay...@gmail.com: Currently the

Re: [Scikit-learn-general] Access data arriving at leaf nodes

2014-10-14 Thread Joel Nothman
in range(0, index.shape[1]): leafVals[j,i] = forestClf.estimators_[i].tree_.value[index[j,i] Many thanks in advance Muhammad Date: Wed, 15 Oct 2014 07:59:09 +1100 From: Joel Nothman joel.noth...@gmail.com Subject: Re: [Scikit-learn-general] Access data arriving at leaf nodes

Re: [Scikit-learn-general] feature union

2014-10-07 Thread Joel Nothman
I don't think it should be fit. You can create a PR to remove it, afaik. On 8 October 2014 04:48, Pagliari, Roberto rpagli...@appcomsci.com wrote: I read this page on the documentation http://scikit-learn.org/stable/auto_examples/feature_stacker.html why is svm.fit needed before

Re: [Scikit-learn-general] feature union

2014-10-07 Thread Joel Nothman
You can even just edit the file directly at https://github.com/scikit-learn/scikit-learn/blob/master/examples/feature_stacker.py On 8 October 2014 08:16, Lars Buitinck larsm...@gmail.com wrote: 2014-10-07 23:03 GMT+02:00 Pagliari, Roberto rpagli...@appcomsci.com: Do I just use the bug

Re: [Scikit-learn-general] train_test_split return values

2014-09-20 Thread Joel Nothman
Or rather, it is a shallow copy. On 20 September 2014 03:09, Andy t3k...@gmail.com wrote: On 09/18/2014 10:34 PM, Joel Nothman wrote: A copy If you use a list as input it is not a copy. -- Slashdot TV. Video

Re: [Scikit-learn-general] train_test_split return values

2014-09-18 Thread Joel Nothman
A copy On 19 September 2014 06:32, Pagliari, Roberto rpagli...@appcomsci.com wrote: When using train_test_split, is the output a reference to the input data, or a deep copy? -- Slashdot TV. Video for Nerds. Stuff

Re: [Scikit-learn-general] binarizer with more levels

2014-09-13 Thread Joel Nothman
? Thanks, *From:* Joel Nothman [mailto:joel.noth...@gmail.com] *Sent:* Thursday, September 11, 2014 9:37 PM *To:* scikit-learn-general *Subject:* Re: [Scikit-learn-general] binarizer with more levels Good point. It should be straightforward in any case, something like: class

Re: [Scikit-learn-general] binarizer with more levels

2014-09-11 Thread Joel Nothman
For quantizing or binning? Not currently. On 12 September 2014 06:31, Pagliari, Roberto rpagli...@appcomsci.com wrote: Is there something like the binarizer with more levels (thresholds provided with input) Thanks

Re: [Scikit-learn-general] binarizer with more levels

2014-09-11 Thread Joel Nothman
get_params_ missing etc… I guess I need to derive my own binarizer from some other classes. Is there a way to simplify the process? Essentially, what I need is the binarizer, with more levels (and thresholds provided to the constructors). Thank you *From:* Joel Nothman [mailto:joel.noth

Re: [Scikit-learn-general] binarizer with more levels

2014-09-11 Thread Joel Nothman
September 2014 11:20, Pagliari, Roberto rpagli...@appcomsci.com wrote: In my case I would like to do it right after scaling, while doing grid search. This would be different to quantize the entire training set at the beginning. Thank you, *From:* Joel Nothman [mailto:joel.noth

Re: [Scikit-learn-general] ValueError: The number of classes has to be greater than one.

2014-09-11 Thread Joel Nothman
Use StratifiedKFold On 12 September 2014 13:03, Pagliari, Roberto rpagli...@appcomsci.com wrote: When using SVM or linearSVC, is it possible to force cross_validation.KFold to generate subsets with both classes (in the case of a two-class problem)?

Re: [Scikit-learn-general] on scaling and grid search

2014-09-06 Thread Joel Nothman
Scaling (or the same scaling procedure) is not always beneficial, but you can certainly do exactly what you are saying by making a pipeline of a StandardScaler and your estimator. See the documentation for Pipeline at http://scikit-learn.org/dev/modules/pipeline.html and

Re: [Scikit-learn-general] sparse datasets loading

2014-08-31 Thread Joel Nothman
We should not encourage users to store sparse data in CSV format. +1 the technique showed by Lars could be applied to any row oriented format, be it text or data read from the network. Perhaps, but then they can construct a sparse format, such as a dict that is passed to DictVectorizer. On

Re: [Scikit-learn-general] Problem with stacking text and binary features in FeatureUnion

2014-08-30 Thread Joel Nothman
I cannot immediately tell why this doesn't work. Firstly, I assume (and hope) it has nothing to do with transformer_weights. Check that removing this still results in the error. The error implies that the transformers (pipelines) are producing data of different shape. Perhaps adding another

Re: [Scikit-learn-general] Problem with stacking text and binary features in FeatureUnion

2014-08-30 Thread Joel Nothman
-08-30 18:07 GMT+08:00 Joel Nothman joel.noth...@gmail.com: I cannot immediately tell why this doesn't work. Firstly, I assume (and hope) it has nothing to do with transformer_weights. Check that removing this still results in the error. The error implies that the transformers (pipelines

Re: [Scikit-learn-general] Problem with stacking text and binary features in FeatureUnion

2014-08-30 Thread Joel Nothman
On the other hand I can't seem to replicate your error. On 30 August 2014 21:56, Joel Nothman joel.noth...@gmail.com wrote: That's not a solution I'm happy with :s On 30 August 2014 21:35, Lakomkin Egor egor.lakom...@gmail.com wrote: Joel, Thank you for your reply. I fixed the problem

Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

2014-08-24 Thread Joel Nothman
dataset. Overfitting? Thanks! De: Joel Nothman joel.noth...@gmail.com Responder a: scikit-learn-general@lists.sourceforge.net scikit-learn-general@lists.sourceforge.net Fecha: martes, 19 de agosto de 2014 00:44 Para: scikit-learn-general scikit-learn-general@lists.sourceforge.net Asunto

Re: [Scikit-learn-general] delta idf and bm25

2014-08-23 Thread Joel Nothman
I agree with Vlad that delta-IDF is interesting; but it is not well supported by the community, and I'm not sure it is worth including ... yet. As Lars points out (and as you suggest), there are other ways to supervise feature weighting. I agree this has to be a separate transformer

Re: [Scikit-learn-general] optimal n_jobs in GridSearchCV

2014-08-21 Thread Joel Nothman
On 21 August 2014 21:46, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Thu, Aug 21, 2014 at 09:44:37PM +1000, Joel Nothman wrote: I think RandomForestClassifier, using multithreading in version 0.15, should work nested in multiprocessing. Good point, as it uses threading. Thus

Re: [Scikit-learn-general] Custom Scoring Functions for Grid Search

2014-08-20 Thread Joel Nothman
It's actually simpler than that issue, Michael. GridSearchCV (and RandomizedSearchCV) has a score method that is unintuitive. It will generally not use the metric passed to `scoring`. But yes, in `fit`, it has used the correct scoring metric. IMO, it should be changed. But it's been this way

Re: [Scikit-learn-general] Custom Scoring Functions for Grid Search

2014-08-20 Thread Joel Nothman
On 20 August 2014 21:41, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Wed, Aug 20, 2014 at 01:37:36PM +0200, federico vaggi wrote: Are there any reasons at all for keeping score function in its current form? No. I think that it is a bug. I'd like it changed, but we need to agree

Re: [Scikit-learn-general] Custom Scoring Functions for Grid Search

2014-08-20 Thread Joel Nothman
I was all too glad to put together a patch: https://github.com/scikit-learn/scikit-learn/pull/3580 On 21 August 2014 01:34, Vlad Niculae zephy...@gmail.com wrote: It has confused me as well, +1. It's counterintuitive and broken, in my opinion. Vlad On Wed, Aug 20, 2014 at 2:31 PM, Gael

Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

2014-08-19 Thread Joel Nothman
I suspect this is a bug in joblib, and that you won't get it with n_jobs=1. Joblib employs memmap for inter-process communication if the array is larger than a fized size: https://github.com/joblib/joblib/blob/master/joblib/pool.py#L203. It seems it needs another criterion to check ensure that the

Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

2014-08-19 Thread Joel Nothman
: # joblib.Parallel functools.partial(class 'sklearn.externals.joblib.parallel.Parallel', max_nbytes=None) I still get the same error though. On Tue, Aug 19, 2014 at 8:19 AM, Joel Nothman joel.noth...@gmail.com wrote: I suspect this is a bug in joblib, and that you won't get

Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

2014-08-19 Thread Joel Nothman
You can also modify that line in sklearn/externals/joblib/pool.py in your local copy of scikit-learn to include an additional condition: and a.dtype.kind != 'O' On 19 August 2014 16:55, Joel Nothman joel.noth...@gmail.com wrote: Oh well. I'm not a very experienced monkey-patcher. There may

Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

2014-08-19 Thread Joel Nothman
(or better, a.dtype.hasobject) On 19 August 2014 16:59, Joel Nothman joel.noth...@gmail.com wrote: You can also modify that line in sklearn/externals/joblib/pool.py in your local copy of scikit-learn to include an additional condition: and a.dtype.kind != 'O' On 19 August 2014 16:55, Joel

Re: [Scikit-learn-general] make_multilabel_classification n_labels

2014-08-19 Thread Joel Nothman
Hi Krishna, I have no problem seeing the difference between n_labels=2 and n_labels=10. However the number of labels per sample can never exceed n_classes, so it is not really the mean number of labels per sample, but the expected value of the Poisson distribution from which the number of labels

Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

2014-08-18 Thread Joel Nothman
If I understand your question correctly, the answer is yes! If you want a clearer response, you might clarify what the alternative hypothesis is to your suggestion. On 19 August 2014 03:13, ZORAIDA HIDALGO SANCHEZ zoraida.hidalgosanc...@telefonica.com wrote: I am using TdidfTransformer on

Re: [Scikit-learn-general] normalizing values (preprocessing)

2014-08-16 Thread Joel Nothman
As with any other estimators in the scikit-learn API, these model parameters are stored in attributes of the estimator object after fit() is called. See the Attributes section of the class documentation. On 17 August 2014 11:39, Pagliari, Roberto rpagli...@appcomsci.com wrote: It does not

Re: [Scikit-learn-general] penalty l1, loss l2

2014-08-14 Thread Joel Nothman
We are searching for the model that minimises a loss (the norm of the vector of differences between predictions and true targets) with a penalty/regularization term (the norm of the vector of weights). l1 and l2 are types of vector norm: l1 refers to the sum of the absolute values of a vector; l2

Re: [Scikit-learn-general] split function with non repeated sets

2014-08-14 Thread Joel Nothman
I suggested KFold because it guarantees that each test set has no overlap with any other, and that all test sets are together a complete partition of the data. On 15 August 2014 04:30, Michael Eickenberg michael.eickenb...@gmail.com wrote: not even kfold does that. the train sets overlap. what

Re: [Scikit-learn-general] FeatureUnion: different transformers to different data

2014-08-13 Thread Joel Nothman
Hi Zoraida, FeatureUnion, together with Pipeline, can already be used for this purpose, although we would benefit from an illustrative example. https://github.com/scikit-learn/scikit-learn/issues/2034 suggests providing a simpler API for this common use-case, but it is hard to come up with an

Re: [Scikit-learn-general] split function with non repeated sets

2014-08-13 Thread Joel Nothman
Could you be more specific, perhaps with an example? Do you mean something like KFold? On 14 August 2014 14:15, Pagliari, Roberto rpagli...@appcomsci.com wrote: Is there a function similar to split function, which does not generate repeated train/test sets?

Re: [Scikit-learn-general] train_test_split consumes too much memory

2014-08-07 Thread Joel Nothman
Are you sure it is train_test_split itself that is taking a long time? What are the dimensions of your data? Are they stored in memory as a numpy array when you call train_test_split? On my MacBook with 16GB RAM I have no problem train_test_splitting np.empty((100, 500),dtype=np.float64),

Re: [Scikit-learn-general] train_test_split consumes too much memory

2014-08-07 Thread Joel Nothman
Try 0.15.1 On 8 August 2014 00:22, ZORAIDA HIDALGO SANCHEZ zoraida.hidalgosanc...@telefonica.com wrote: Andy, I am using version 0.14.1. My data are python list with strings :_| De: Andreas Mueller t3k...@gmail.com Responder a: scikit-learn-general@lists.sourceforge.net

Re: [Scikit-learn-general] GridSearch comparing two preprocessors (or graph paths)

2014-08-07 Thread Joel Nothman
This is possible with https://github.com/scikit-learn/scikit-learn/pull/1769, which includes an example of something quite similar. Reviews would be greatly appreciated! On 8 August 2014 07:32, Ronnie Ghose ronnie.gh...@gmail.com wrote: No afaik but it's easy enough to build in :) On Aug 7,

Re: [Scikit-learn-general] Using LSH Forest approximate neibghbor search in DBSCAN[GSoC]

2014-08-06 Thread Joel Nothman
that it can be used with pipelines. How would we deal with this? On Wed, Aug 6, 2014 at 11:22 AM, Joel Nothman joel.noth...@gmail.com wrote: It seems to me that the LSH forest is substituting for the `algorithm` parameter, which selects between ball_tree, kd_tree and brute search for nearest

Re: [Scikit-learn-general] Using LSH Forest approximate neibghbor search in DBSCAN[GSoC]

2014-08-05 Thread Joel Nothman
It seems to me that the LSH forest is substituting for the `algorithm` parameter, which selects between ball_tree, kd_tree and brute search for nearest neighbour search. These are designed not to take additional parameters. So you need to accept additional parameters. You could indeed create

Re: [Scikit-learn-general] Sharing my quick hacks for constructing pipelines/gridsearches

2014-08-02 Thread Joel Nothman
You might enjoy `make_union` and `make_pipeline` in the 0.15 release. On 3 August 2014 01:09, Anders Aagaard aagaa...@gmail.com wrote: Hi I found myself constructing custom BaseEstimators very often to do neat stuff with pipelines. And I almost always use pandas dataframe for easy

Re: [Scikit-learn-general] How to implement cross_val_score scoring function with a weights array?

2014-08-02 Thread Joel Nothman
You could use the implementation of sample_weight support in cross-validation from https://github.com/scikit-learn/scikit-learn/pull/1574, which should work but doesn't have much in the way of tests. It may be superseded by https://github.com/scikit-learn/scikit-learn/pull/3524 On 3 August 2014

Re: [Scikit-learn-general] [GSoC] - Logistic Regression CV (Manoj Kumar)

2014-07-29 Thread Joel Nothman
makes it difficult, to compare the effects of really low tolerances with different solvers. @Joel and other core devs Sorry for the dumb question but what is the status on modifying the the liblinear source files? On Tue, Jul 29, 2014 at 2:44 AM, Joel Nothman joel.noth...@gmail.com

Re: [Scikit-learn-general] calculate the posterior probability

2014-07-29 Thread Joel Nothman
Here: https://github.com/scikit-learn/scikit-learn/pull/1176 On 29 July 2014 21:59, Lars Buitinck larsm...@gmail.com wrote: 2014-07-28 23:46 GMT+02:00 Mario Michael Krell kr...@uni-bremen.de: I have to somehow contradict. In fact it would be possible to get a probability but it requires

Re: [Scikit-learn-general] sparse datasets loading

2014-07-29 Thread Joel Nothman
I think the scipy folks intend that numpy-like setting operations should suffice for many cases (although be a bit slower than the technique you've illustrated). E.g. you can use: X[i, nonzero] = data[nonzero] to replace some lines of Lars' code. One disadvantage of this approach is needing to

Re: [Scikit-learn-general] gridSearchCV best_estimator_ best_score_

2014-07-29 Thread Joel Nothman
the whole training set (with C found earlier), or are they the averaged over the k folds? This is not explicitly mentioned in the documentation. I’m trying to understand what the text highlighted above means. Thank you, Roberto *From:* Joel Nothman [mailto:joel.noth...@gmail.com] *Sent

Re: [Scikit-learn-general] gridSearchCV best_estimator_ best_score_

2014-07-28 Thread Joel Nothman
I do think you're right to attempt to improve it! Please submit a PR! On 29 July 2014 00:05, Pagliari, Roberto rpagli...@appcomsci.com wrote: You are right. I guess only C (in the case of linear SVM) is the best averaged over the fold. And once C is found, the weights over the whole

Re: [Scikit-learn-general] gridSearchCV best_estimator_ best_score_

2014-07-28 Thread Joel Nothman
for the clarification, *From:* Joel Nothman [mailto:joel.noth...@gmail.com] *Sent:* Monday, July 28, 2014 10:32 AM *To:* scikit-learn-general *Subject:* Re: [Scikit-learn-general] gridSearchCV best_estimator_ best_score_ I do think you're right to attempt to improve it! Please

Re: [Scikit-learn-general] RBK Kernel - Query

2014-07-28 Thread Joel Nothman
You can find the answer by googling scikit-learn-general and umang patel: https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg10981.html As it does not pertain directly to scikit-learn, this is also a question that you might get a more thorough answer for in a forum like

Re: [Scikit-learn-general] [GSoC] - Logistic Regression CV (Manoj Kumar)

2014-07-28 Thread Joel Nothman
There is actually an open PR to import the sample_weight changes into the scikit-learn copy of liblinear: https://github.com/scikit-learn/scikit-learn/pull/2784. It would appreciate some love, or someone to executively decide that it's not worth including. On 29 July 2014 10:36, Sean Violante

Re: [Scikit-learn-general] gridSearchCV best_estimator_ best_score_

2014-07-26 Thread Joel Nothman
I think best_estimator_ could also be clarified a bit more to say that it is refit on all training data (and only available if refit=True) On 26 July 2014 18:42, Andy t3k...@gmail.com wrote: On 07/25/2014 10:30 PM, Pagliari, Roberto wrote: Hi Andy, Maybe it’s just me, but the ”left out

Re: [Scikit-learn-general] Evaluation measure for imbalanced data

2014-07-23 Thread Joel Nothman
CORRELATION http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/27165/Powers%20Evaluation.pdf I warmly recommend MCC, though lots of people still use ROC On Wed, Jul 23, 2014 at 6:09 AM, Joel Nothman joel.noth...@gmail.com wrote: Precision, Recall and F-measure are often contrasted

Re: [Scikit-learn-general] 'GridSearchCV' object has no attribute 'best_estimator_'

2014-07-23 Thread Joel Nothman
Please make sure you call fit() first, as in http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html On 24 July 2014 02:07, Pagliari, Roberto rpagli...@appcomsci.com wrote: I’m getting this error when trying to predict using the result of grid search with

Re: [Scikit-learn-general] Evaluation measure for imbalanced data

2014-07-22 Thread Joel Nothman
Precision, Recall and F-measure are often contrasted with Accuracy in terms of their handling imbalance. I'm sure I could find a textbook citation, but for an online example Chris Manning thus introduces P/R/F in the imbalanced spam classification problem on coursera:

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Joel Nothman
cf. https://github.com/scikit-learn/scikit-learn/pull/3243 On 17 July 2014 08:59, Christian Jauvin cjau...@gmail.com wrote: I can open an issue, but on the other hand, you could argue that the new behaviour is now at least consistent with the other encoder types, e.g.: le = LabelEncoder()

Re: [Scikit-learn-general] scikit-learn 0.15.0 is out \o/

2014-07-15 Thread Joel Nothman
Yay! Thanks Olivier for getting this out the door! On 15 July 2014 21:37, Valerio Maggio valerio.mag...@gmail.com wrote: On 15 Jul 2014, at 13:13, Olivier Grisel olivier.gri...@ensta.org wrote: http://scikit-learn.org/stable/whats_new.html Plenty of wheel packages on PyPI and people

Re: [Scikit-learn-general] Sample weighting in RandomizedSearchCV

2014-07-08 Thread Joel Nothman
This shouldn't be the case, though it's not altogether well-documented. According to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1225, if the fit_params value has the same length as the samples, it should be similarly indexed. So this would be a bug ...

Re: [Scikit-learn-general] An advise about suitable datasets

2014-07-07 Thread Joel Nothman
But corpora-list http://mailman.uib.no/listinfo/corpora might be a better place to ask. On 7 July 2014 13:43, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Thank you Kyle. I have a look in these. Maheshakya On Mon, Jul 7, 2014 at 10:41 PM, Kyle Kastner kastnerk...@gmail.com

Re: [Scikit-learn-general] Concatenating scikit.sparse matrix and numpy arrays

2014-07-03 Thread Joel Nothman
response Joel, I may be wrong but FeatureUnion is for the same X and I have several X(one for each source), isn’t it? Thanks. De: Joel Nothman joel.noth...@gmail.com Responder a: scikit-learn-general@lists.sourceforge.net scikit-learn-general@lists.sourceforge.net Fecha: jueves, 3 de julio

Re: [Scikit-learn-general] Extending TfIdf Vectorizer to use given idf set

2014-07-01 Thread Joel Nothman
Pulling the IDF out of Lucene is a little bit trickier, but otherwise DictVectorizer pipelined with TfidfTransformer should be able to do this. On 1 July 2014 16:40, Lars Buitinck larsm...@gmail.com wrote: 2014-07-01 21:03 GMT+02:00 Geetu Ambwani geet...@gmail.com: I imagine this transformer

Re: [Scikit-learn-general] Extending TfIdf Vectorizer to use given idf set

2014-07-01 Thread Joel Nothman
} } } } } So we get individual term frequency and document frequency per field. We need some combination of the DictVectorizer pipelined with a kind of TfIdfTransformer that can compute tf/idf from the json data given. On Tue, Jul 1, 2014 at 5:30 PM, Joel Nothman joel.noth

Re: [Scikit-learn-general] Strings as features

2014-06-30 Thread Joel Nothman
are easy, I guess. Former chains the features obtained from each individual estimators given as the input were as the latter uses the estimators, on the result obtained from the previous estimator in a chained fashion. On Mon, Jun 23, 2014 at 1:06 AM, Joel Nothman joel.noth...@gmail.com wrote

Re: [Scikit-learn-general] Clustering using TfidfVectorizer

2014-06-30 Thread Joel Nothman
It may be beneficial to use some kind of query expansion or unsupervised dimensionality reduction, as the vectors from a bag of words encoding will probably be very sparse. Does that help? On 30 June 2014 03:03, Abijith Kp abijith@gmail.com wrote: Hi, Is it possible to use

Re: [Scikit-learn-general] Explicitly mention loadings in PCA documentation/examples

2014-06-27 Thread Joel Nothman
I have been hoping at some point to extend the document generation such that it automatically inserts Example links (with thumbnail icons) from reference API pages (e.g. http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) to examples where that

Re: [Scikit-learn-general] Force DictVectorizer to 1-of-N encode ALL features

2014-06-26 Thread Joel Nothman
Turn them into strings first is by far and away the easiest solution! Alternatively, look up the feature names in the dict_vectorizer.feature_names_ attribute, then follow the DictVectorizer with a OneHotEncoder where the categorical_features parameter is set. HTH, Joel On 26 June 2014 17:54,

Re: [Scikit-learn-general] Help for learning/contributing

2014-06-25 Thread Joel Nothman
Hi Ignacio, A good starting place is often working on the documentation. For example, https://github.com/scikit-learn/scikit-learn/pull/3084 is an attempt at filling in a gap in the documentation, but it doesn't look like Raul is going to complete the work any time soon. If you want to pull his

Re: [Scikit-learn-general] Getting decision tree regressor to predict using median not mean, of final subset

2014-06-23 Thread Joel Nothman
I think that should be Tree.apply, not apply_Tree. I.e. I guess you want to use something like (unverified): for leaf_ind, values in groupby(sorted(zip(regressor.tree_.apply(X_train), y_train)), operator.itemgetter(0)): regressor.tree_.values[leaf_ind, ...] = np.median(list(values)) On 23

Re: [Scikit-learn-general] OneVsRestClassifier : No attribute predict_proba

2014-06-23 Thread Joel Nothman
It seems that there is a class label present in all training instances... On 23 June 2014 10:20, abhishek abhish...@gmail.com wrote: Hi all, Ive been getting this very weird error when using OneVsRestClassifier.

Re: [Scikit-learn-general] OneVsRestClassifier : No attribute predict_proba

2014-06-23 Thread Joel Nothman
Not that this error is correct behaviour, but that you might not be aware that there is a likely problem with your data. On 23 June 2014 10:30, Joel Nothman joel.noth...@gmail.com wrote: It seems that there is a class label present in all training instances... On 23 June 2014 10:20

<    1   2   3   4   >