Re: [Scikit-learn-general] Finding a corresponding leaf node for each data point in a decision tree

2015-05-26 Thread Andreas Mueller
Actually with "the newest version" Gilles meant the "dev" version 0.17-dev that is not released yet. So with 0.16.1, your way (manual conversion) is the right way, and with using the dev version or after the release, you can just do tree.apply(X). On 05/25/2015 12:33 PM, Kittipat Kampa wrote:

Re: [Scikit-learn-general] Feature selection

2015-05-28 Thread Andreas Mueller
Hi Herbert. 1) Often reducing the features space does not help with accuracy, and using a regularized classifier leads to better results. 2) To do feature selection, you need two methods: one to reduce the set of features, another that does the actual supervised task (classification here). Ha

Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-28 Thread Andreas Mueller
Hi Artem. Thanks for sharing the post. For 1.: Always start without Cython. Then do profiling to identify the bottlenecks. Then we can talk about if the bottleneck can be helped with Cython. For this kind of algorithm, I'd expect the bottleneck to be in some matrix multiplication (though I don'

Re: [Scikit-learn-general] [GSoC2015 Improve GMM module]

2015-05-28 Thread Andreas Mueller
en 4 pdf pages full of equations. I think there will be 10 pages for all four kinds of covariance matrix. Upon I finish that, I will upload it to my blog. Thanks, Wei Xue On Tue, May 19, 2015 at 11:07 AM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hey Wei Xue. Thanks

Re: [Scikit-learn-general] how to know which feature is informative or redundant in make_classification()?

2015-05-28 Thread Andreas Mueller
How large is your noise and what are the other arguments to the function? Use the source, Luke: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/samples_generator.py The data is generated the way Joel said. On 05/28/2015 12:13 PM, Daniel Homola wrote: Hi Joel, I mig

Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-28 Thread Andreas Mueller
I talked to my office mate Brian McFee for some feedback. Apparently there are some memory and computation saving methods, but they are all not well-published unfortunately. I think you should try create some benchmarks for your current implementation, using classification with 1 nearest neighbo

Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-28 Thread Andreas Mueller
On 05/28/2015 05:11 PM, Michael Eickenberg wrote: > > Code-wise, I would attack the problem as a function first. Write a > function that takes X and y (plus maybe some options) and gives back > L. You can put a skeleton of a sklearn estimator around it by calling > this function from fit. > Pl

Re: [Scikit-learn-general] my silence

2015-06-01 Thread Andreas Mueller
Thanks for letting us know. I'll discuss some API stuff with Gael today, afterwards there might be some interesting issues ;) On 05/31/2015 08:28 PM, Joel Nothman wrote: Just a quick note that I've been silent lately because I've been Busy With Life, but also because github was notifying an e

Re: [Scikit-learn-general] Classifiers that do not require feature scaling

2015-06-04 Thread Andreas Mueller
Tree-based methods are the only ones that are invariant towards feature scaling, do DecisionTree*, RandomForest*, ExtraTrees*, Bagging* (with trees), GradientBoosting* (with trees). For all other algorithms, the outcome will be different whether you scale your data or not. For algorithms like

Re: [Scikit-learn-general] Classifiers that do not require feature scaling

2015-06-04 Thread Andreas Mueller
On 06/04/2015 02:04 PM, Sturla Molden wrote: > On 04/06/15 17:15, Andreas Mueller wrote: > >> Tree-based methods are the only ones that are invariant towards feature >> scaling, do DecisionTree*, RandomForest*, ExtraTrees*, Bagging* (with >> trees), GradientBoosting* (wi

Re: [Scikit-learn-general] Classifiers that do not require feature scaling

2015-06-05 Thread Andreas Mueller
The result of scaled an non-scaled data will be different because the regularization will have a different effect. On 06/05/2015 03:10 AM, Yury Zhauniarovich wrote: Thank you all! However, what Sturla wrote is now out of my understanding. One more question. It seems also to me that Naive Bayes

Re: [Scikit-learn-general] Classifiers that do not require feature scaling

2015-06-05 Thread Andreas Mueller
that there is no technical advantage of feature scaling, however, the results will be different with and without scaling. On Jun 5, 2015, at 1:03 PM, Andreas Mueller mailto:t3k...@gmail.com>> wrote: The result of scaled an non-scaled data will be different because the reg

Re: [Scikit-learn-general] Estimator Overview / Summary

2015-06-08 Thread Andreas Mueller
Ah, I thought it was you, but I didn't think you closed it. Thanks! On 06/07/2015 02:47 AM, Mathieu Blondel wrote: https://github.com/scikit-learn/scikit-learn/pull/804 Thanks for working on this! Mathieu On Sun, Jun 7, 2015 at 6:11 AM, Andy > wrote: Hi all.

Re: [Scikit-learn-general] SciPy 2015 Birds-of-a-Feather Submission

2015-06-08 Thread Andreas Mueller
Sounds like a good idea. How about an open discussion BoF? Anyone thinks that would be good? On 06/06/2015 10:36 AM, Kyle Mandli wrote: > Members of the scikit-learn Community, > > As one of the co-chairs in charge of organizing the birds-of-a-feather > sessions at SciPy this year I wanted to rea

Re: [Scikit-learn-general] Model evaluation on multiclass with ROC curves

2015-06-09 Thread Andreas Mueller
There is currently no such function unfortunately. Is there a standard definition of multi-class roc-curves? Per-class? On 06/09/2015 05:43 AM, Herbert Schulz wrote: Hello everyone, is there a way in scikit-learn to evaluate my prediction ( multiclass ) with a ROC curve ? For example: One vs A

Re: [Scikit-learn-general] Incrementally Printing GridSearch Results

2015-06-10 Thread Andreas Mueller
Yes, set verbose to a nonzero value. On 06/10/2015 03:25 PM, Adam Goodkind wrote: Is it possible to print the results of a grid search as each iteration is completed? Thanks, Adam -- *Adam Goodkind * adamgoodkind.com @adamgreatkind

Re: [Scikit-learn-general] Incrementally Printing GridSearch Results

2015-06-14 Thread Andreas Mueller
it prints out a lot. Is there a way to refine the output to just the parameters and scores? Thanks, Adam On Wed, Jun 10, 2015 at 3:41 PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Yes, set verbose to a nonzero value. On 06/10/2015 03:25 PM, Adam Goodkind wrote:

Re: [Scikit-learn-general] specificity calculation

2015-06-15 Thread Andreas Mueller
No, there is not. PR welcome, I think. On 06/15/2015 07:35 AM, Herbert Schulz wrote: Hello, is there a function to calculate the specificity? i know there is the classification_report function, but there is no specificity. best regards, Herb --

Re: [Scikit-learn-general] normalize with nan values

2015-06-15 Thread Andreas Mueller
Hey. Not with scikit-learn but it should be about three lines in numpy to do it yourself. I would replace them with 0 for computing the norm, that is all there is, right? Andy On 06/15/2015 10:43 AM, William Correa beltran wrote: Hello, I would like to know if there is a way to normalize a

Re: [Scikit-learn-general] differences between metrics.classification_report and "own" function

2015-06-17 Thread Andreas Mueller
Yeah that is the rounding of using %2f in the classification report. On 06/17/2015 09:20 AM, Joel Nothman wrote: To me, those numbers appear identical at 2 decimal places. On 17 June 2015 at 23:04, Herbert Schulz > wrote: Hello everyone, i wrote a function

Re: [Scikit-learn-general] differences between metrics.classification_report and "own" function

2015-06-17 Thread Andreas Mueller
ecall/sensitivity. recall == sensitivity!? But in this matrix, the precision is my calculated sensitivity, or is the precision in this case the sensitivity? On 17 June 2015 at 15:29, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Yeah that is the rounding of using %2f in the

Re: [Scikit-learn-general] Logistic Regression: how to set up a model for a combination of numeric and binary predictors?

2015-06-18 Thread Andreas Mueller
Hi Felix. You should be fine the way you do it. You might want to rescale the continuous values though, possibly to lie within 0 to 1, using MinMaxScaler. Cheers, Andy On 06/18/2015 05:05 AM, Felix Dreher wrote: > Dear all, > > I have a general question about running logistic regression with a >

Re: [Scikit-learn-general] clustering large data sets

2015-06-18 Thread Andreas Mueller
On 06/18/2015 09:38 AM, Kyle Kastner wrote: > Minibatch K-means should work just fine. Alternatively there are > hebbian K-means approaches which are quite easy to implement and > should be fast (though I think it basically boils down to minibatch > K-means, I haven't looked at details of mini

Re: [Scikit-learn-general] clustering large data sets

2015-06-18 Thread Andreas Mueller
On 06/18/2015 09:48 AM, Kyle Kastner wrote: > This link should work http://www.cs.toronto.edu/~rfm/code.html > Is that faster / better than minibatch k-means? Is there a paper? -

Re: [Scikit-learn-general] clustering large data sets

2015-06-18 Thread Andreas Mueller
there is a paper ref - though it might be "too easy" to have a real paper. On Thu, Jun 18, 2015 at 9:58 AM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: On 06/18/2015 09:48 AM, Kyle Kastner wrote: > This link should work http://www.cs.toronto.edu/~rfm/code

Re: [Scikit-learn-general] 2nd try: Re: suggestions for unequal group training

2015-06-23 Thread Andreas Mueller
Also, you should think about what your performance measure should be, and if it should be accuracy (usually it is not). AUC is often good, but you need to choose an operating point in the end. On 06/23/2015 10:58 AM, Trevor Stephens wrote: Many of the scikit-learn classifiers are equipped with a

[Scikit-learn-general] GSoC midterms

2015-06-30 Thread Andreas Mueller
Hey All. This is a friendly reminder to all the mentors and students that mid-terms are coming up this Friday. Mentors should if possible at all submit their reviews before that. It would be great to have at least parts of the projects merged by then. I was out for the last week and I'm a bit beh

Re: [Scikit-learn-general] GSoC midterms NOW!

2015-06-30 Thread Andreas Mueller
Sorry, late to my emails. Terri actually wants the mid-terms done TODAY! On 06/30/2015 02:25 PM, Andreas Mueller wrote: > Hey All. > This is a friendly reminder to all the mentors and students that > mid-terms are coming up this Friday. > Mentors should if possible at all submit t

Re: [Scikit-learn-general] Warm_start on Random Forest Classifiers

2015-06-30 Thread Andreas Mueller
It does the second. You need to feed it the same data. On 06/25/2015 11:59 AM, Rafael Calsaverini wrote: I saw the new parameter warm_start on the RandomForestClassifier class and was curious about what is its most common use. I can see two uses for it: (1) instead of fitting a huge forest in

Re: [Scikit-learn-general] RandomForestClassifier with warm_start and n_jobs

2015-06-30 Thread Andreas Mueller
Unless I am misremembering how warm starts are implemented (tree growers around?) the comment seems badly phrased. I think what it means to say is that warm-starting repeatedly with the number of trees increasing by increments of 1 will make the fitting be serial (you only built a single tree a

Re: [Scikit-learn-general] Passing kwargs to pipeline predict

2015-06-30 Thread Andreas Mueller
Can you illustrate your use-case Michael? On 06/25/2015 02:14 PM, Joel Nothman wrote: As much as possible, parameters to a model should be specified to the class constructor, not methods, even if application is there. This has been the scikit-learn design for a while in order to enable things

Re: [Scikit-learn-general] Library of pre-trained models

2015-06-30 Thread Andreas Mueller
For most applications, this will not work, as the training data needs to come from the same distribution as your test data. Language identification is pretty simple, and training a linear classifier on n-grams should get you quite a bit. On 06/28/2015 09:58 AM, Erez Segal wrote: I was searching

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Andreas Mueller
On 06/30/2015 10:26 PM, Mathieu Blondel wrote: > Regardless, a major issue is that we still haven't figured out how to > robustly solve model persistence. > Theano uses __setstate__ and __getstate__ and they seem to be happy with that. We could add a library of "previously pickled" models to t

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Andreas Mueller
On 07/01/2015 10:27 AM, Fred Mailhot wrote: > 1) The upshot seems to be that it's a defensive patent, and in any > case the code was released under Apache 2.0, so it's fine to use. > https://code.google.com/p/word2vec/ > https://groups.google.com/forum/#!topic/word2vec-toolkit/1hID9F74_Ho >

Re: [Scikit-learn-general] Is it possible to specify the order of spliting in decision tree with scikit-learn?

2015-07-01 Thread Andreas Mueller
Not really, at that kind of defeats the purpose of learning the tree. you could built a series of stumps that first only get feature a, then feature b and then feature c. On 06/30/2015 11:37 PM, Rex wrote: Given three columns, ["A", "B", "C"], can we specify the order of splitting, so that it

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Andreas Mueller
On 07/01/2015 02:42 PM, Lars Buitinck wrote: > 2015-07-01 16:27 GMT+02:00 Fred Mailhot : >> 2) The gensim implementation predates the patenting > Does that matter? > no -- Don't Limit Your Business. Reach for the Cloud.

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Andreas Mueller
On 07/01/2015 02:49 PM, Gael Varoquaux wrote: > On Wed, Jul 01, 2015 at 11:04:30AM -0400, Andreas Mueller wrote: >> Theano uses __setstate__ and __getstate__ and they seem to be happy with >> that. > As long as we don't change the data model that works easily, but then so

Re: [Scikit-learn-general] Planning on implementing sample_weight option for PLSRegression.fit()

2015-07-10 Thread Andreas Mueller
I think there are also some concerns about the actual implementation of the algorithm, and whether the NIPALS implementation is fast and stable enough, and the appropriate choice in all cases. On 07/07/2015 04:51 PM, Deepak Subburam wrote: > Cool. I spent some time inspecting the pls_.py source

Re: [Scikit-learn-general] Linking error with intel 13.1

2015-07-11 Thread Andreas Mueller
Hi Ben. How did you install? I'm not sure if we can get the linker flags form numpy. Is your numpy also installed with the intel compiler? Thanks, Andy On 07/11/2015 11:45 AM, Fulton, Ben wrote: > Hi, > > I tried to install sklearn on our cluster. It installed successfully, but > when running w

Re: [Scikit-learn-general] PDF User's Guide for 0.16.2

2015-07-11 Thread Andreas Mueller
There is no 0.16.2. The current version is 0.16.1 On 07/09/2015 08:57 AM, Dale Smith wrote: Hello, when can we expect a PDF version of the User’s Guide for 0.16.2? https://sourceforge.net/projects/scikit-learn/files/documentation/ Thanks very much. *Dale Smith, Ph.D.* Data Scientist ​ http:

Re: [Scikit-learn-general] Linking error with intel 13.1

2015-07-11 Thread Andreas Mueller
n the intel/13.1 libraries. > > -- > Ben Fulton > Research Technologies > Scientific Applications and Performance Tuning > Indiana University > E-Mail: beful...@iu.edu > > -Original Message- > From: Andreas Mueller [mailto:t3k...@gmail.com] > Sent: Saturday,

Re: [Scikit-learn-general] Decsion tree regression -- mean squared error or variance reduction

2015-07-11 Thread Andreas Mueller
Maybe we should add a note that this is equivalent to variance reduction? Still have not found the time for your thesis :-/ On 07/09/2015 03:30 PM, Sebastian Raschka wrote: > Thanks, I think my confusion came from the fact that they use x_i as target > variable, and I was thinking of feature/at

Re: [Scikit-learn-general] Fwd: Anomaly detection twitter algorithm

2015-07-11 Thread Andreas Mueller
This looks like it is specific to sequence data, right? We don't really deal with sequence data in sklearn. On 07/06/2015 01:10 AM, Arie Agranonik wrote: Hi all, I was wondering if anyone has implemented the twitter anomaly detection algorithm in scikit-learn? From what I see it's currently i

Re: [Scikit-learn-general] Estimators of RAKEL and (Ensemble) Classifier Chain for multilabel proposal

2015-07-11 Thread Andreas Mueller
I think classifier chains would be a nice addition. I am not very familiar with the different variants you implemented, though. On 07/10/2015 09:32 AM, Al wrote: > I should probably have put a link to the current implementation as it is > now. The link to this project (purged, i removed the vario

Re: [Scikit-learn-general] Linking error with intel 13.1

2015-07-11 Thread Andreas Mueller
pthread'] > library_dirs = > ['/N/soft/rhel6/intel/13.1.2/composer_xe_2013.4.183/mkl/lib/intel64/'] > define_macros = [('SCIPY_MKL_H', None)] > include_dirs = > ['/N/soft/rhel6/intel/13.1.2/composer_xe_2013.4.183/mkl/include'] > None

Re: [Scikit-learn-general] About contributing code

2015-07-28 Thread Andreas Mueller
Hi Gryllos. Before contributing a new feature (which is usually a major undertaking) it us usually a good idea to get started working on known issues, have a look at the issue tracker. I'm not familiar with the feature line approach. Can you elaborate and provide a reference? Please see the

Re: [Scikit-learn-general] About contributing code

2015-07-28 Thread Andreas Mueller
PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hi Gryllos. Before contributing a new feature (which is usually a major undertaking) it us usually a good idea to get started working on known issues, have a look at the issue tracker. I'm not fam

Re: [Scikit-learn-general] does sklearn rbm scale well with sparse high dimensional features

2015-07-28 Thread Andreas Mueller
Have a look at Russ Salakhutdinov's thesis for work on density modelling. The problem is that it is impossible to compute the partition function, and therefore you can only get unnormalized densities. On 07/27/2015 12:49 PM, Mika S wrote: Thanks, this is helpful. I have seen RBMs only in pret

Re: [Scikit-learn-general] Contribution

2015-07-28 Thread Andreas Mueller
What do you mean by "you need topics which are to be implemented from scratch"? Need for what? Do you want to help the project or do you want to implement an algorithm? If you like you could try to improve the speed of the gradient boosting trees. I think this is a worthwhile but non-trivial u

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-28 Thread Andreas Mueller
Hi Stylianos. Can you give a bit more background on the model? It seems fairly well-cited but I haven't really seen it in practice. Is it still state of the art? The main purpose seems to be a particular type of regularization, right, not supervised dimensionality reduction? How does this compar

Re: [Scikit-learn-general] Problem with user-defined kernel in SVM

2015-07-28 Thread Andreas Mueller
For reference, it was answered there: http://stackoverflow.com/questions/31599624/user-defined-svm-kernel-with-scikit-learn On 07/23/2015 07:33 PM, Vincent Leclère wrote: Hello everybody, I'm encountering a trouble fact dealing with sklearn.svm user defined kernel. Here is a minimal example

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-28 Thread Andreas Mueller
I'd be happy with adding Poisson loss to more models, thought I think it would be more natural to first add it to GLM before GBM ;) If the addition is straight-forward, I think it would be a nice contribution nevertheless. 1) for the user to do np.exp(gbmpoisson.predict(X)) is not acceptable. Th

Re: [Scikit-learn-general] Big Data Mining

2015-07-28 Thread Andreas Mueller
My personal recommendation is to consider other options if your data is >1tb but I highly it depends on your application. Gael and Olivier you use it also for larger data, right? On 07/24/2015 03:25 AM, Gael Varoquaux wrote: > Because this is a question that comes up often, I have tried to give

Re: [Scikit-learn-general] Contribution

2015-07-28 Thread Andreas Mueller
Have a look at https://github.com/scikit-learn/scikit-learn/pull/5041 btw. On 07/28/2015 01:36 PM, Sreenivas Raghavan wrote: Thank you for the idea. i will start right away. On Tue, Jul 28, 2015 at 11:41 PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: What do you mean by

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-28 Thread Andreas Mueller
isson model. But maybe "Poisson loss" in machine learning is unrelated to the Poisson distribution or a Poisson model with E(y| x) = exp(x beta). ? Josef On Tue, Jul 28, 2015 at 2:46 PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: I'd be happy with addin

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-29 Thread Andreas Mueller
bout predict_proba(X, at_y=some_integer)? >> >> However, this is also mean that we can't use predict_proba to detect >> classifiers anymore... >> Another solution would be to introduce a new method >> predict_proba_at(X, y=some_integer)... >> >> Ma

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-29 Thread Andreas Mueller
conditional probability and not a > conditional likelihood (the quantities on the right-hand side of > conditioning are fixed and integrating over all y would be 1). > > On 29.07.2015 16:16, Andreas Mueller wrote: >> Shouldn't that be "score_samples"? >> Well, it

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-29 Thread Andreas Mueller
nce of lots of variables. Best regards, Stelios 2015-07-28 19:16 GMT+01:00 Andreas Mueller <mailto:t3k...@gmail.com>>: Hi Stylianos. Can you give a bit more background on the model? It seems fairly well-cited but I haven't really seen it in practice. Is it still s

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-30 Thread Andreas Mueller
I think KMediods has come up before. One issues is that it doesn't really scale to large n_samples, right? There is an implementation mentioned here: https://github.com/scikit-learn/scikit-learn/issues/3799 Do you use it because you have a custom distance matrix? On 07/30/2015 02:27 PM, Sebastia

Re: [Scikit-learn-general] Implementation of DBCLASD for clustering

2015-07-31 Thread Andreas Mueller
Hi Sebastian. Have you seen this used much recently? How does it compare against DBSCAN, BIRCH, OPTICS or just KMeans? Cheers, Andy On 07/31/2015 10:28 AM, Sebastián Palacio wrote: Hello all, I've been investigating clustering algorithms with special interest in non-parametric methods and,

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-31 Thread Andreas Mueller
Cool. Including the code in scikit-learn is often a bit of a process but it might indeed be interesting. You could just start with a pull request - or publish a gist if you don't think you'll have time to work on the inclusion and leave that part to someone else. Cheers, Andy On 07/31/2015 0

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-31 Thread Andreas Mueller
<https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Partitioning_Around_Medoids_%28PAM%29> - CLARA (Clustering for Large Applications): https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA Best, Sebastian On Jul 31, 2015, at

Re: [Scikit-learn-general] Implementation of DBCLASD for clustering

2015-08-03 Thread Andreas Mueller
AN and OPTICS: in terms of size of the dataset; it copes with noise (as oppose to DBSCAN, BIRCH and K-Means) and it has a complexity of O(3n^2) which compares with DBSCAN's O(n^2) Regards, Sebastian On 31 July 2015 at 18:43, Andreas Mueller <mailto:t3k...@gmail.com>> wrote:

Re: [Scikit-learn-general] About contributing code

2015-08-03 Thread Andreas Mueller
Thank you for your time. I will continue watching the issue page and maybe help with something. Best Regards, Prokopis On Tue, Jul 28, 2015 at 8:43 PM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hi Gryllos. Before contributing a new feature (which is usually a major

Re: [Scikit-learn-general] Weird memory error

2015-08-04 Thread Andreas Mueller
Just to make sure, you are actually loading different files, not the same file over and over again, right? It seems an odd place for a memory error. Which version of scikit-learn are you using? What is ``len(j_indices)``? On 08/04/2015 10:18 AM, Maria Gorinova wrote: Hello, (I think I might

Re: [Scikit-learn-general] boosting: false-positives versus false-negatives

2015-08-04 Thread Andreas Mueller
Hi Simon. In general in scikit-learn you could use class-weights to make one class more important then the other. Unfortunately that is not implemented for AdaBoost yet. You can however use the sample_weights parameter of the fit method, and create sample weights either by hand based on the class

Re: [Scikit-learn-general] AUC realy low

2015-08-04 Thread Andreas Mueller
You should select the other column from predict_proba for auc. On 08/04/2015 10:54 AM, Herbert Schulz wrote: Thanks for the answer! hmm its possible, I just make a little example: auc is [0.952710670069, 0.01890450385597026, 0.0059624156214325846, 0.05391726570661811] expected is [0.0,

Re: [Scikit-learn-general] Weird memory error

2015-08-04 Thread Andreas Mueller
\feature_extraction\text.py as shown in the exception trace. The version I'm using is 0.15.2 (I think...) Best, Maria On 4 August 2015 at 16:30, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Just to make sure, you are actually loading different files, not the same file

Re: [Scikit-learn-general] Multiple normal scenario for one-class SVM

2015-08-04 Thread Andreas Mueller
Hi Ady. Are you selecting parameters separately for the two models in the separate case? Btw, if you are modelling a single normal, maybe EllipticEnvelope would work better. Best, Andy On 08/04/2015 01:07 PM, Ady Wahyudi Paundu wrote: > Hi all, > > How am I supposed to work with multiple set of

Re: [Scikit-learn-general] Weird memory error

2015-08-04 Thread Andreas Mueller
lp. Best, Maria On 4 August 2015 at 17:24, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Thanks Maria. What I was asking was that you could use the debugger to see what len(j_indices) is when it crashes. I'm not sure if there were improvements to this code

Re: [Scikit-learn-general] Multiple normal scenario for one-class SVM

2015-08-04 Thread Andreas Mueller
Thank you for your suggestion, I will look into it. > > Regards, > Ady > > On 8/5/15, Andreas Mueller wrote: >> Hi Ady. >> Are you selecting parameters separately for the two models in the >> separate case? >> Btw, if you are modelling a single normal, maybe E

Re: [Scikit-learn-general] Weird memory error

2015-08-04 Thread Andreas Mueller
at 18:26, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: That array would take about 700mb of ram. Do you have that much available? Btw, you could work around this issue probably by using HashingVectorizer instead of CountVectorizer. Yes, I've got plenty of memory, e

Re: [Scikit-learn-general] boosting: false-positives versus false-negatives

2015-08-05 Thread Andreas Mueller
s with class weights. Am I > missing something ? > > Perhaps my approach is completely wrong and I should > be doing something else like regression or something. > > Many thanks, > > Simon. > > > > On Tue, 04 Aug 2015 11:36:31 -0400 > Andreas Mueller wrote: > &g

Re: [Scikit-learn-general] contributing

2015-08-06 Thread Andreas Mueller
Hey Jaret. It is usually easier to discuss these things on the github issue tracker. Which is your pull request? Just ask there. For the doctests you can do "make test-doc" that will run nosetests with the appropriate options. For the whitespace, there is an option to ignore whitespace changes.

Re: [Scikit-learn-general] contributing

2015-08-06 Thread Andreas Mueller
like to make sure I get off on the right foot. Thanks so much for your help and responses. -Jaret On Thu, Aug 6, 2015 at 9:49 AM, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hey Jaret. It is usually easier to discuss these things on the github issue tracker. W

Re: [Scikit-learn-general] scikit-learn Truck Factor

2015-08-12 Thread Andreas Mueller
https://peerj.com/preprints/1233 > > We calculated the TF for scikit-learn and obtained a value of 7. > > The developers responsible for this TF are: > > Fabian Pedregosa - author of 22% of the files > Gael varoquaux -

Re: [Scikit-learn-general] scikit-learn Truck Factor

2015-08-12 Thread Andreas Mueller
d a value of 7. The developers responsible for this TF are: Fabian Pedregosa - author of 22% of the files Gael varoquaux - author of 13% of the files Andreas Mueller - author of 12% of the files Olivier Grisel - author of 10% of the files Lars Buitinck - author of 10% o

Re: [Scikit-learn-general] scikit-learn Truck Factor

2015-08-12 Thread Andreas Mueller
scikit-learn and obtained a value of 7. The developers responsible for this TF are: Fabian Pedregosa - author of 22% of the files Gael varoquaux - author of 13% of the files Andreas Mueller - author of 12% of the files Olivier Grisel - author of 10% of the

Re: [Scikit-learn-general] scikit-learn-0.17.dev0-py3.4: self-tests output

2015-08-12 Thread Andreas Mueller
It's a bit strange. Maybe a consequence of refactoring of common tests. Can you check which module they were in? On 08/11/2015 03:28 PM, Sergio Rojas wrote: > This time, the only strange thing is the difference in the number of > tests performed before and after installing sklearn: > > 1.- sel

Re: [Scikit-learn-general] Question on the code for Decision Trees

2015-08-13 Thread Andreas Mueller
For C you should definitely check out this: https://github.com/ajtulloch/sklearn-compiledtrees/ It's linked here btw ;) http://scikit-learn.org/dev/related_projects.html On 08/13/2015 01:04 PM, Simon Burton wrote: > Surprisingly, I am working on a similar code generation project, > with the targe

Re: [Scikit-learn-general] Question on the code for Decision Trees

2015-08-13 Thread Andreas Mueller
't > have time to get it to work. > > > Dale Smith, Ph.D. > Data Scientist > ​ > > > d. 404.495.7220 x 4008 f. 404.795.7221 > Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA > 30305 > > > > -Original M

Re: [Scikit-learn-general] Encoding a categorical variable that appears in multiple features

2015-08-14 Thread Andreas Mueller
Why do you think one-hot will be an "explosion"? In your example, the vector would be length 8 (if there are values from a to f, that is, you gave the largest possible sets). On 08/14/2015 09:01 AM, federico vaggi wrote: Hi, Simple example: Let's say that I have a binary classification task

Re: [Scikit-learn-general] Encoding a categorical variable that appears in multiple features

2015-08-14 Thread Andreas Mueller
case, 8*8. I just realized however, that since the order does not matter, and I just want to indicate the presence or absence of a categorical feature in a set, I can simply use two vectors (stacked together) of length n_categories (or 2*8). On Fri, 14 Aug 2015 at 16:04 Andreas Mueller <ma

Re: [Scikit-learn-general] Hello message

2015-08-17 Thread Andreas Mueller
arious guidelines. Once you are ready to start making some changes, simply visit the repo's Issues page <https://github.com/scikit-learn/scikit-learn/issues> to look for things to do. As Andreas Mueller told me, a good place to start is with issues labeled "bug"/"need contri

Re: [Scikit-learn-general] Suggestion for Multiclass.py !

2015-08-18 Thread Andreas Mueller
Did you get an error when inputting a 1d X? Which version of scikit-learn are you on? X should really always be 2d. Unfortunately that is currently inconsistent, and will be fixed soon. On 08/18/2015 10:43 AM, Mathieu Blondel wrote: Hi Othman, Please send such comments to the mailing-list.

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-18 Thread Andreas Mueller
I'm not aware of any ready-made code. But you can just get the boolean matrix by using ``apply`` and a one-hot encoder. Why are you interested in a single leave? the query seems to be able to return "only" a single boolean. It is probably more efficient to traverse the full tree for each data po

Re: [Scikit-learn-general] Suggestion to Have multiclass.py allow prediction over one sample only !

2015-08-18 Thread Andreas Mueller
Hi. I just replied to the thread above, maybe you weren't subscribed to the ml yet. Did you get an error when inputting a 1d X? Which version of scikit-learn are you on? X should really always be 2d. Unfortunately that is currently inconsistent, and will be fixed soon. So yes, that will be

Re: [Scikit-learn-general] Suggestion to Have multiclass.py allow prediction over one sample only !

2015-08-18 Thread Andreas Mueller
Out[*2*]: array([0]) # Proper output Regards, Othman Soufan PhD Candidate Mathematical and Computer Sciences and Engineering King Abdullah University of Science and Technology Thuwal 23955-6900 KAUST Mail Box # 2620 Kingdom of Saudi Arabia Tel.: (+966) 506134003 On Tue, Aug 18, 2015 at 6:54 PM, An

Re: [Scikit-learn-general] Inspection of Classifications

2015-08-24 Thread Andreas Mueller
I think these are really easy to write for a single use-case, and hard to be generally useful. Why do you think pipelines make it hard? You know you can extract the estimators from the steps, right? def feature_importances_pipeline(pipe): extractor = pipe.steps[0][1] linear_model = pipe

Re: [Scikit-learn-general] help installing scikit-learn (and scipy) on Cygwin

2015-08-24 Thread Andreas Mueller
On 08/20/2015 08:42 PM, Sebastian Raschka wrote: > 99% of the people who are doing any kind of research with Python are > using it I'm curious to see the stats on that ;) Also, if you're happy with the distribution versions of packages, linux is usually really simple. It's only these two odd o

Re: [Scikit-learn-general] zeroed sample_weights vector for SVC?

2015-08-24 Thread Andreas Mueller
Zeroing all weights doesn't really make sense, zeroing some weights should result in ignoring these samples. If it doesn't, it's a bug. Please report. On 08/22/2015 07:41 AM, olologin wrote: > Hello folks, i just want to ask, is it correct to provide partially > (entirely) zeroed sample_weights

Re: [Scikit-learn-general] About C50

2015-08-24 Thread Andreas Mueller
Is there any reason you want C5.0 instead of CART? On 08/22/2015 04:25 PM, Omar Andrés Zapata Mesa wrote: Thanks you Artem Best Regards Omar. On Sat, Aug 22, 2015 at 6:29 AM, Artem > wrote: Do you mean C5.0 which is further development of C4.5 tree algorith

Re: [Scikit-learn-general] Persisting models

2015-08-24 Thread Andreas Mueller
On 08/19/2015 12:37 AM, Sebastian Raschka wrote: >> if the unpickling failed, >> >what would you do? > One lesson “scientific research” taught me is to store the code and dataset > along with a “make” file under version control (git):). I would just run my > make file to re-construct the mode

Re: [Scikit-learn-general] Persisting models

2015-08-24 Thread Andreas Mueller
I think the real solution is to provide backward-compatible ``__getattr__`` and ``__setattr_``. Theano seems able to do that (at least that is what I was told). It is unclear weather we want to do this. If we want to do this, we probably only want it post 1.0 On 08/19/2015 02:35 AM, Joel Nothm

Re: [Scikit-learn-general] Persisting models

2015-08-24 Thread Andreas Mueller
Agreed—this is exactly the type of use case I want to support. Pickling won't work here, but using HDF5 like MNE does would probably be close to ideal (thanks to Chris Holdgraf for the heads-up): I'm not sure how this solves the issue, can you elaborate? You still need to map the old data structu

[Scikit-learn-general] Tests against reference implementations, speed regression tests

2015-08-25 Thread Andreas Mueller
Hey all. I will soon have some student dev resources and I'm pondering how to best use them. Apart from the hundreds of issues, one thing I was thinking about adding is more tests against reference implementations, and having speed regression tests. For the reference implementations, we could h

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-27 Thread Andreas Mueller
I think it would be fine to enable it now without support in all solvers. On 8/27/2015 11:29 AM, Valentin Stolbunov wrote: Joel, I see you've done some work in that PR. Is an additional review all that's needed there? Looks like changes in Logistic Regression CV broke the original contribution

Re: [Scikit-learn-general] How to do tree Pruning with scikit-learn?

2015-08-31 Thread Andreas Mueller
You will not get results close to ensembles with pruning (unless your dataset is very specific). You can probably do your node filtering on ensembles, too. On 08/30/2015 03:44 PM, Rex X wrote: Jacob, I agree with both of your points about the ensemble methods. They can give quite good predicti

Re: [Scikit-learn-general] scikit-learn, would like to contribute

2015-08-31 Thread Andreas Mueller
Hi Pieter. Please keep this kind of discussions on the mailing list. Any single contributor might be busy. All the easy and "need contributor" issues are great places to start to get you familiar with the library. Is there something particular that interests you? Best, Andy On 08/31/2015 02:48

Re: [Scikit-learn-general] introducing myself: starting to contribute to scikit-learn

2015-08-31 Thread Andreas Mueller
Hi Pieter. Welcome to the project. There are many open issues. Check the "easy" and "needs contributor" tags. Also, it might be possible to pick up old pull requests that got stalled. Best, Andy On 08/29/2015 11:47 PM, Pieter de Jong wrote: Hi all, I hope to contribute to scikit-learn in the

  1   2   3   4   5   6   7   8   9   10   >