Re: [scikit-learn] is Sci_kiet-Learn the right choice for my project

2022-10-08 Thread Brown J.B. via scikit-learn
Dear Mike, Just my two cents about your inquiry, where I strictly a user of scikit-learn for many years. - From your description of application context, I would say that scikit-learn is perfectly fine. However, I would suggest the awareness that a monolithic model incorporating all data (as is

Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-26 Thread Brown J.B. via scikit-learn
Congratulations to all of those who volunteered so much effort over so many years to achieve a 1.0. In my experiences in research academia and now in industry, scikit-learn is such a workhorse relied on by many individuals and companies, and the many who donated their efforts have made it possible

Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Brown J.B. via scikit-learn
2021年7月27日(火) 12:03 Guillaume Lemaître : > As far that I remember, `precision_recall_curve` and `roc_curve` do not > support multi class. They are design to work only with binary > classification. > Correct, the TPR-FPR curve (ROC) was originally intended for tuning a free parameter, in signal

Re: [scikit-learn] Drawing contours in KMeans

2020-12-09 Thread Brown J.B. via scikit-learn
Dear Mahmood, Andrew's solution with a circle will guarantee you render an image in which every point is covered within some circle. However, if data contains outliers or artifacts, you might get circles which are excessively large and distort the image you want. For example, imagine if there

Re: [scikit-learn] Presented scikit-learn to the French President

2020-12-06 Thread Brown J.B. via scikit-learn
Congratulations to all developers and contributors to scikit-learn, from core-devs to webmasters, documentation checkers and commenters, and other facilitators! Keeping a project alive takes a substantial amount of vision and hard work, and scikit-learn is a mature ecosystem because of the vision

Re: [scikit-learn] Opinion on reference mentioning that RF uses weak learners

2020-08-16 Thread Brown J.B. via scikit-learn
> As previously mentioned, a "weak learner" is just a learner that barely performs better than random. To continue with what the definition of a random learner refers to, does it mean the following contexts? (1) Classification: a learner which uniformly samples from one of the N endpoints in the

Re: [scikit-learn] Understanding max_features parameter in RandomForestClassifier

2020-03-10 Thread Brown J.B. via scikit-learn
Regardless of the number of features, each DT estimator is given only a subset of the data. Each DT estimator then uses the features to derive decision rules for the samples it was given. With more trees and few examples, you might get similar or identical trees, but that is not the norm. Pardon

Re: [scikit-learn] Why ridge regression can solve multicollinearity?

2020-01-08 Thread Brown J.B. via scikit-learn
Just for convenience: Marquardt, Donald W., and Ronald D. Snee. "Ridge regression in practice." *The > American Statistician* 29, no. 1 (1975): 3-20. > https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1975.10479105 ___ scikit-learn mailing list

Re: [scikit-learn] SVM-RFE

2019-12-04 Thread Brown J.B. via scikit-learn
after showing interest, let's see if I can't actually succeed for once. 2019年12月5日(木) 1:14 Andreas Mueller : > PR welcome ;) > > > On 12/3/19 11:02 PM, Brown J.B. via scikit-learn wrote: > > 2019年12月3日(火) 5:36 Andreas Mueller : > >> It does provide the ranking of feature

Re: [scikit-learn] SVM-RFE

2019-12-03 Thread Brown J.B. via scikit-learn
2019年12月3日(火) 5:36 Andreas Mueller : > It does provide the ranking of features in the ranking_ attribute and it > provides the cross-validation accuracies for all subsets in grid_scores_. > It doesn't provide the feature weights for all subsets, but that's > something that would be easy to add if

Re: [scikit-learn] SVM-RFE

2019-11-25 Thread Brown J.B. via scikit-learn
2019年11月23日(土) 2:12 Andreas Mueller : > I think you can also use RFECV directly without doing any wrapping. > > Your request to do performance checking of the steps of SVM-RFE is a > pretty common task. > > Yes, RFECV works well (and I should know as an appreciative long-time user ;-) ), but

Re: [scikit-learn] SVM-RFE

2019-11-19 Thread Brown J.B. via scikit-learn
Dear Malik, Your request to do performance checking of the steps of SVM-RFE is a pretty common task. Since the contributors to scikit-learn have done great to make the interface to RFE easy to use, the only real work required from you would be to build a small wrapper function that: (a) computes

Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25

2019-10-13 Thread Brown J.B. via scikit-learn
Please, respect and refinement when addressing the contributors and users of scikit-learn. Gael's statement is perfect -- complexity does not imply better prediction. The choice of estimator (and algorithm) depends on the structure of the model desired for the data presented. Estimator

Re: [scikit-learn] Test Sample Size

2019-07-22 Thread Brown J.B. via scikit-learn
Dear Milton, It is just my opinion based on many experiences, but if you want to stress-test your estimator, make your test set at least as big as, if not bigger than, the training set. Sincerely, J.B. 2019年7月22日(月) 22:18 Milton Pifano : > Dear scikit-learn subscribers. > > I am working on a

Re: [scikit-learn] Scikit Learn in a Cray computer

2019-06-28 Thread Brown J.B. via scikit-learn
> > where you can see "ncpus = 1" (I still do not know why 4 lines were > printed - > > (total of 40 nodes) and each node has 1 CPU and 1 GPU! > > #PBS -l select=1:ncpus=8:mpiprocs=8 > aprun -n 4 p.sh ./ncpus.py > You can request 8 CPUs from a job scheduler, but if each node the script runs on

Re: [scikit-learn] How use get_depth

2019-06-17 Thread Brown J.B. via scikit-learn
Perhaps you mean: DecisionTreeRegressor.tree_.max_depth , where DecisionTreeRegressor.tree_ is available after calling fit() ? 2019年6月17日(月) 22:29 Wendley Silva : > Hi all, > > I tried several ways to use the get_depth() method from > DecisionTreeRegression, but I always get the same error: > >

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Brown J.B. via scikit-learn
2019年6月5日(水) 10:43 Brown J.B. : > Contrast this to Pearson Product Moment Correlation (R), where the fit of > the line has no requirement to go through the origin of the fit. > Not sure what I was thinking when I wrote that. Pardon the mistake; I'm fully aware that Pearson R

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-04 Thread Brown J.B. via scikit-learn
Dear CW, > Linear regression is not a black-box. I view prediction accuracy as an > overkill on interpretable models. Especially when you can use R-squared, > coefficient significance, etc. > Following on my previous note about being cautious with cross-validated evaluation for classification,

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-02 Thread Brown J.B. via scikit-learn
> > As far as I understand: Holding out a test set is recommended if you > aren't entirely sure that the assumptions of the model are held (gaussian > error on a linear fit; independent and identically distributed samples). > The model evaluation approach in predictive ML, using held-out data,

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-02 Thread Brown J.B. via scikit-learn
As a user, I feel that (2) "sklearn.plot.XXX.plot_YYY" best allows for future expansion of sub-namespaces in a tractable way that is also easy to understand during code review. For example, sklearn.plot.tree.plot_forest() or sklearn.plot.lasso.plot_* . Just my opinion. J.B. 2019年4月2日(火) 23:40

Re: [scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread Brown J.B. via scikit-learn
> Take random forest as example, if I give estimator from 10 to 1(10, > 100, 1000, 1) into grid search. > Based on the result, I found estimator=100 is the best, but I don't know > lower or greater than 100 is better. > How should I decide? brute force or any tools better than

Re: [scikit-learn] Difference between linear model and tree-based regressor?

2018-12-13 Thread Brown J.B. via scikit-learn
"Elements of Statistical Learning" is on my bookshelf, but even so, that was a great summary! J.B. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Brown J.B. via scikit-learn
As an end-user, I would strongly support the idea of future enforcement of keyword arguments for new parameters. In my group, we hold a standard that we develop APIs where _all_ arguments must be given by keyword (slightly pedantic style, but has shown to have benefits). Initialization/call-time

Re: [scikit-learn] Can I use Sklearn Porter to Generate C++ version of Random Forest Predict function

2018-11-01 Thread Brown J.B. via scikit-learn
I, too, would be curious to know if anyone has any experience in doing this. J.B. 2018年11月1日(木) 2:07 Chidhambaranathan R : > Hi, > > I'd like to know if I can use sklearn_porter to generate the C++ version > of Random Forest Regression Predict function. If sklearn_porter doesn't > work, is there

Re: [scikit-learn] Dimension Reduction - MDS

2018-10-11 Thread Brown J.B. via scikit-learn
would be able to be processed > > with 64G of RAM. Is there something to configure to allow this > computation? > > > > The typical datasets I use can have around 200-300k rows with a few > columns > > (usually up to 3). > > > > Best regards, >

Re: [scikit-learn] Dimension Reduction - MDS

2018-10-09 Thread Brown J.B. via scikit-learn
Hello Guillaume, You are computing a distance matrix of shape 7x7 to generate MDS coordinates. That is 49,000,000 entries, plus overhead for a data structure. If you try with a very small (e.g., 100 sample) data file, does your code employing MDS work? As you increase the number of

Re: [scikit-learn] Bootstrapping in sklearn

2018-09-18 Thread Brown J.B. via scikit-learn
Resampling is a very important interesting contribution which relates very closely to my primary research in applied ML for chemical development. I'd be very interested in contributing documentation and learning new things along the way, but I potentially would be perceived as slow because of

Re: [scikit-learn] Using GPU in scikit learn

2018-08-08 Thread Brown J.B. via scikit-learn
Dear Ta Hoang, GPU processing can be done with Python libraries such as TensorFlow, Keras, or Theano. However, sklearn's implementation of RandomForestClassifier is outstandingly fast, and a previous effort to develop GPU RandomForest abandoned their efforts as a result:

Re: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem

2018-07-21 Thread Brown J.B. via scikit-learn
Hello Makis, 2018-07-20 23:44 GMT+09:00 Andreas Mueller : > There is no single roc curve for a 3 class problem. So what do you want to > plot? > > On 07/20/2018 10:40 AM, serafim loukas wrote: > > What I want to do is to plot the average(mean) ROC across Folds for a > 3-class case. > > The

Re: [scikit-learn] sample_weights in RandomForestRegressor

2018-07-16 Thread Brown J.B. via scikit-learn
Dear Thomas, Your strategy for model development is built on the assumption that the SAR (structure-activity relationship) is a continuous manifold constructed for your compound descriptors. However, SARs for many proteins in drug discovery or chemical biology are not continuous (consider kinase

Re: [scikit-learn] PyCM: Multiclass confusion matrix library in Python

2018-06-05 Thread Brown J.B. via scikit-learn
tion at runtime from the CLI, and can then tailor results to my audiences. :) I'll keep this video's explanation in mind - thanks for the reference. Cheers, J.B. > On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote: > > Hello community, > > I wonder if there's someth

Re: [scikit-learn] PyCM: Multiclass confusion matrix library in Python

2018-06-04 Thread Brown J.B. via scikit-learn
Hello community, I wonder if there's something similar for the binary class case where, >> the prediction is a real value (activation) and from this we can also >> derive >> - CMs for all prediction cutoff (or set of cutoffs?) >> - scores over all cutoffs (AUC, AP, ...) >> > AUC and AP are by

Re: [scikit-learn] Announcing modAL: a modular active learning framework

2018-02-19 Thread Brown J.B. via scikit-learn
Dear Dr. Danka, This is a very nice generalization you have built. My group and I have published multiple papers on using active learning for drug discovery model creation, built on top of scikit-learn. (2017) Future Med Chem : https://dx.doi.org/10.4155/fmc-2016-0197 (*Most downloaded paper of

Re: [scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread Brown J.B. via scikit-learn
Dear Yang Li, > Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100),

Re: [scikit-learn] MLPClassifier as a feature selector

2017-12-06 Thread Brown J.B. via scikit-learn
I am also very interested in knowing if there is a sklearn cookbook solution for getting the weights of a one-hidde-layer MLPClassifier. J.B. 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis : > Greetings, > > I want to train a MLPClassifier with one hidden layer and use it as a >

Re: [scikit-learn] scikit-learn Digest, Vol 19, Issue 37

2017-10-17 Thread Brown J.B. via scikit-learn
2017-10-18 12:18 GMT+09:00 Ismael Lemhadri : > How about editing the various chunks of code concerned to add the option > to scale the parameters, and set it by default to NOT scale? This would > make what happens clear without the redundancy Andreas mentioned, and would >

Re: [scikit-learn] Remembering Raghav, our friend, and a scikit-learn contributor

2017-10-06 Thread Brown J.B. via scikit-learn
This is truly, truly sad news. Leaving the home country you grew up in to find your way in a new language and culture takes considerable effort, and to thrive at it takes even more effort. He was to be commended for that. I think many of us knew of his enthusiasm for the project and benefited

Re: [scikit-learn] imbalanced-learn 0.3.0 is chasing scikit-learn 0.19.0

2017-08-25 Thread Brown J.B.
In drug discovery, if you are lucky you might get hit compounds 10% of the time. So if you do ML-based drug discovery, your datasets are strongly imbalanced. It seems the imbalanced package would be perfect for this area. J.B. 2017-08-25 10:53 GMT+02:00 Jaques Grobler :

Re: [scikit-learn] Fwd: [SciPy-User] EuroSciPy 2017 call for contributions - extension of deadline

2017-06-30 Thread Brown J.B.
Dear Communities, Would it be of interest to the audience to hear a discussion on the state of the art in computational drug discovery model development, which my team and I have done by building on top of Scikit-learn and Matplotlib? Everyday language description of the work and concept:

Re: [scikit-learn] Random Forest max_features and boostrap construction parameters interpretation

2017-06-05 Thread Brown J.B.
ss 0, 0 <= x <= 10 is class 1, > and x > 10 is class 0 again). > > Let me know if you have any other questions! > > On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. <jbbr...@kuhp.kyoto-u.ac.jp> > wrote: > >> Dear community, >> >> This is a question

[scikit-learn] Random Forest max_features and boostrap construction parameters interpretation

2017-06-05 Thread Brown J.B.
Dear community, This is a question regarding how to interpret the documentation and semantics of the random forest constructors. In forest.py (of version 0.17 which I am still using), the documentation regarding the number of features to consider states on lines 742-745 of the source code that

Re: [scikit-learn] SVC data normalisation

2017-05-08 Thread Brown J.B.
Dear Mamun, *A.* 80% features are binary [ 0 or 1 ] > *B.* 10% are integer values representing counts / occurrences. > *C.* 10% are continuous values between different ranges. > > My prior understanding was that decision tree based algorithms work better > on mixed data types. In this particular

[scikit-learn] Note of appreciation to Scikit-learn team

2017-03-21 Thread Brown J.B.
To all organizers, developers, and maintainers involved in the Scikit-learn project, I would like to share a recent article that researchers from MIT, ETH, and Kyoto University (myself) have published about building efficient models for drug discovery and pharmaceutical data mining. In short, it

Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Brown J.B.
Hello Thomas, I don't personally know of any algorithm that works on collections of groupings, but why not first test a simple control model, meaning can you achieve a satisfactory model by simply concatenating all 48 scores per sample and building a forest the standard way? If not, what context

Re: [scikit-learn] ANN Scikit-learn 0.18 released

2016-10-02 Thread Brown J.B.
Hello community, Congratulations on the release of 0.19 ! While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts! 2016-10-01 1:58 GMT+09:00 Andreas Mueller : We've got a lot in the works already for 0.19. >> >> *