Re: [scikit-learn] is Sci_kiet-Learn the right choice for my project

2022-10-08 Thread Brown J.B. via scikit-learn
Dear Mike,

Just my two cents about your inquiry, where I strictly a user of
scikit-learn for many years.

- From your description of application context, I would say that
scikit-learn is perfectly fine. However, I would suggest the awareness that
a monolithic model incorporating all data (as is the image TV wrongfully
projects) is not a valid strategy. Stratifying data into contextually
correct subgroups and then running scikit-learn, for example to estimate
during development the extent of predictability, will be helpful.
- Duplicate checking should be easy to use using standard python objects
(set or list counting), once the context derives how the objects are
vectorized/featurized. I don't see a need to force scikit-learn for that
context.
- Missing data could be implemented by context-specific object classes that
you design, which could contain something like a __bool__()  method that
could tell if you if the object has all of the required data populated and
configured.
- Detection of errors in configuration could be either explicitly driven by
logic (of the context, again something to return a bool that an object is
configured correctly), or potentially could be statistically derived as
outliers from the given background data distribution, in which then
scikit-learn could be of help. If there are too many variates (thousands or
tens of thousands) in your data that prohibit explicit logic, then
scikit-learn's Random Forest algorithms might be perfectly fine and provide
verification through visualization of Decision Tree rules.

Hope this helps,
J.B. Brown

2022年10月8日(土) 10:59 Mike Oliver :

> Dear Sirs,
>
>
>
> I am evaluating SciKit-Learn for a new project.  I am hoping to find a AI
> Machine Learning package that can take a large dataset of objects that have
> various object types and attributes.  These objects are typically related
> to other objects, such as a server to a Wifi device, or two network routers
> to each other, etc.  When these objects are setup data is gathered about
> where they are located, what settings there are, the device type, etc.
>
>
>
> With large organizations there can be thousands of these objects and tens
> of thousands of relationships, descriptions, settings, etc.  My hope is
> that with machine learning we can detect when an object is missing, or
> configured in error, or duplicates.
>
>
>
> The question is, will SciKit-Learn help with this problem? I understand
> that we will have to train it to identify what to look for and then act on
> what was found and predicted to be the solution algorithm. Or instructions.
>
>
>
> Thanks for your help,
>
>
>
> Great looking product and already have the tutorial up and running and
> have installed it in my Django platform.
>
>
>
> Mike
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-26 Thread Brown J.B. via scikit-learn
Congratulations to all of those who volunteered so much effort over so many
years to achieve a 1.0.
In my experiences in research academia and now in industry, scikit-learn is
such a workhorse relied on by many individuals and companies, and the many
who donated their efforts have made it possible for so many people
(including me!) to enhance their research or development.
With deep appreciation,
J.B. Brown
Principal Scientist, Boehringer-Ingelheim Pharma, Germany
Associate Professor, Kyoto University, Japan


2021年9月26日(日) 13:23 Joel Nothman :

> Thanks to some amazing work from the core development team, as well as our
> triagers, and other contributors. We finally got here!
>
> On Sat, 25 Sept 2021 at 03:13, Olivier Grisel 
> wrote:
>
>> Yeah!
>>
>> Thank you so much Adrin for all your efforts in getting this release out!
>>
>> Congratulations everyone, time to celebrate!
>>
>> --
>> Olivier
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Brown J.B. via scikit-learn
2021年7月27日(火) 12:03 Guillaume Lemaître :

> As far that I remember, `precision_recall_curve` and `roc_curve` do not
> support multi class. They are design to work only with binary
> classification.
>

Correct, the TPR-FPR curve (ROC) was originally intended for tuning a free
parameter, in signal detection, and is a binary-type metric.
For ML problems, it lets you tune/determine an estimator's output value
threshold (e.g., a probability or a raw discriminant value such as in SVM)
for arriving an optimized model that will be used to give a final,
binary-discretized answer in new prediction tasks.

Hope this helps, J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Drawing contours in KMeans

2020-12-09 Thread Brown J.B. via scikit-learn
Dear Mahmood,

Andrew's solution with a circle will guarantee you render an image in which
every point is covered within some circle.

However, if data contains outliers or artifacts, you might get circles
which are excessively large and distort the image you want.
For example, imagine if there were a single red point in Andrew's image at
the coordinate (3,10); then, the resulting circle would cover all points in
the entire plot, which is unlikely what you want.
You could potentially generate a density estimate for each class and then
have matplotlib render the contour lines (e.g., solutions of where
estimates have a specific value), but as was said, this is not the job of
Kmeans, but rather of general data analysis.

The ellipsoid solution proposed to you is, in a sense, a middle ground
between these two solutions (the circles and the density plots).
You could adjust the (4 or 5) parameters of an ellipsoid to cover "most" of
the points for a particular class and tolerate that the ellipsoids don't
cover a few outliers or artifacts (e.g., the coordinate (3,10) I mentioned
above).
The resulting functional forms of the ellipses might be more precise than
circles and less complex than density contours, and might lead to
actionable knowledge depending on your context/domain.

Hope this helps.
J.B. Brown

2020年12月9日(水) 21:08 Mahmood Naderan :

> >Mebbe principal components analysis would suggest an
> >ellipsoid containing "most" of the points in a "cloud".
>
> Sorry I didn't understand. Can you explain more?
> Regards,
> Mahmood
>
>
>
>
> On Wed, Dec 9, 2020 at 8:55 PM The Helmbolds via scikit-learn <
> scikit-learn@python.org> wrote:
>
>> [scikit-learn] Drawing contours in KMeans4
>>
>>
>> Mebbe principal components analysis would suggest an ellipsoid containing
>> "most" of the points in a "cloud".
>>
>>
>>
>>
>> "You won't find the right answers if you don't ask the right questions!"
>> (Robert Helmbold, 2013)
>>
>>
>> On Wednesday, December 9, 2020, 12:22:49 PM MST, Andrew Howe <
>> ahow...@gmail.com> wrote:
>>
>>
>> Ok, I see. Well the attached notebook demonstrates doing this by simply
>> finding the maximum distance from each centroid to it's datapoints and
>> drawing a circle using that radius. It's simple, but will hopefully at
>> least point you in a useful direction.
>> [image: image.png]
>> Andrew
>>
>> <~~~>
>> J. Andrew Howe, PhD
>> LinkedIn Profile 
>> ResearchGate Profile 
>> Open Researcher and Contributor ID (ORCID)
>> 
>> Github Profile 
>> Personal Website 
>> I live to learn, so I can learn to live. - me
>> <~~~>
>>
>>
>> On Wed, Dec 9, 2020 at 12:59 PM Mahmood Naderan 
>> wrote:
>>
>> I mean a circle/contour to group the points in a cluster for better
>> representation.
>> For example, if there are 6 six clusters, it will be more meaningful to
>> group large data points in a circle or contour.
>>
>> Regards,
>> Mahmood
>>
>>
>>
>>
>> On Wed, Dec 9, 2020 at 11:49 AM Andrew Howe  wrote:
>>
>> Contours generally indicate a third variable - often a probability
>> density. Kmeans doesn't provide density estimates, so what precisely would
>> you want the contours to represent?
>>
>> Andrew
>>
>> <~~~>
>> J. Andrew Howe, PhD
>> LinkedIn Profile 
>> ResearchGate Profile 
>> Open Researcher and Contributor ID (ORCID)
>> 
>> Github Profile 
>> Personal Website 
>> I live to learn, so I can learn to live. - me
>> <~~~>
>>
>>
>> On Wed, Dec 9, 2020 at 9:41 AM Mahmood Naderan 
>> wrote:
>>
>> Hi
>> I use the following code to highlight the cluster centers with some red
>> dots.
>>
>> kmeans = KMeans(n_clusters=6, init='k-means++', max_iter=100, n_init=10,
>> random_state=0)
>> pred_y = kmeans.fit_predict(a)
>> plt.scatter(a[:,0], a[:,1])
>> plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
>> s=100, c='red')
>> plt.show()
>>
>> I would like to know if it is possible to draw contours over the
>> clusters. Is there any way for that?
>> Please let me know if there is a function or option in KMeans.
>>
>> Regards,
>> Mahmood
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> 

Re: [scikit-learn] Presented scikit-learn to the French President

2020-12-06 Thread Brown J.B. via scikit-learn
Congratulations to all developers and contributors to scikit-learn, from
core-devs to webmasters, documentation checkers and commenters, and other
facilitators!
Keeping a project alive takes a substantial amount of vision and hard work,
and scikit-learn is a mature ecosystem because of the vision and hard work
of everyone.

This recognition by the French government is fantastic -- congratulations
Gael to you, your leadership, and your team!
In fact, scikit-learn is probably more ubiquitous than anyone individually
recognizes, because for all of the contributions in github and mailing
lists, there are probably many more people who are benefitting from
applying it to their individual scenarios.

I myself am a very appreciative user. :)
Sincere regards and congratulations again,
J.B. Brown



2020年12月5日(土) 17:53 Sebastian Raschka :

> This is really awesome news! Thanks a lot to everyone developing
> scikit-learn. I am just wrapping up another successful semester, teaching
> students ML basics. Most coming from an R background, they really loved
> scikit-learn and appreciated it's ease of use and well-thought-out API.
>
> Best,
> Sebastian
>
> > On Dec 5, 2020, at 9:28 AM, Jitesh Khandelwal 
> wrote:
> >
> > Amazing, inspiring! Kudos to the sklearn team.
> >
> > On Sat, Dec 5, 2020, 4:30 AM Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:
> > Hi scikit-learn community,
> >
> > Today, I presented some efforts in digital health to the French president
> > and part of the government. As these efforts were partly powered by
> > scikit-learn (and the whole pydata stack, to be fair), the team in charge
> > of the event had printed a huge scikit-learn logo behind me:
> > https://twitter.com/GaelVaroquaux/status/1334959438059462659 (terrible
> > mobile-phone picture)
> >
> > I would have liked to get a picture with the president and the logo, but
> > it seems that they are releasing only a handful of pictures :(.
> Anyhow...
> >
> >
> > Thanks to the community! This is a huge success. For health topics (we
> > are talking nationwide electronic health records) the ability to build on
> > an independent open-source stack is extremely important. We, as a wider
> > community, are building something priceless.
> >
> > Cheers,
> >
> > Gaël
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Opinion on reference mentioning that RF uses weak learners

2020-08-16 Thread Brown J.B. via scikit-learn
> As previously mentioned, a "weak learner" is just a learner that barely
performs better than random.

To continue with what the definition of a random learner refers to, does it
mean the following contexts?
(1) Classification: a learner which uniformly samples from one of the N
endpoints in the training data (e.g., the set of unique values in the
response vector "y").
(2) Regression: a learner which uniformly samples from the range of values
in the endpoint/response vector (e.g., uniform sampling from [min(y),
max(y)]).

Should even more context be explicitly declared (e.g., not uniform sampling
but any distribution sampler)?

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Understanding max_features parameter in RandomForestClassifier

2020-03-10 Thread Brown J.B. via scikit-learn
Regardless of the number of features, each DT estimator is given only a
subset of the data.
Each DT estimator then uses the features to derive decision rules for the
samples it was given.
With more trees and few examples, you might get similar or identical trees,
but that is not the norm.

Pardon brevity.
J.B.

2020年3月11日(水) 14:11 aditya aggarwal :

> For RandomForestClassifier in sklearn
>
> max_features parameter gives the max no of features for split in random
> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
> combinations for DT formation is nCm. What if nCm is less than n_estimators
> (no of decision trees in random forest)?
>
> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
> combinations of features for decision trees. Now for n_estimators = 100,
> will the remaining 65 trees have repeated combination of features? If so,
> won't trees be correlated introducing bias in the answer?
>
>
> Thanks
>
> Aditya Aggarwal
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why ridge regression can solve multicollinearity?

2020-01-08 Thread Brown J.B. via scikit-learn
Just for convenience:

Marquardt, Donald W., and Ronald D. Snee. "Ridge regression in practice." *The
> American Statistician* 29, no. 1 (1975): 3-20.
>

https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1975.10479105
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] SVM-RFE

2019-12-04 Thread Brown J.B. via scikit-learn
I certainly am guilty of only commenting in the mailing list and not
engaging more via GitHub! :)
(Much like many of you PIs on this list, the typical
ActualWork-GrantWriting-ReportWriting-InvitedLectures-RealLifeParenting
cycle eats the day away.)

While I've failed previously to get involved after showing interest, let's
see if I can't actually succeed for once.

2019年12月5日(木) 1:14 Andreas Mueller :

> PR welcome ;)
>
>
> On 12/3/19 11:02 PM, Brown J.B. via scikit-learn wrote:
>
> 2019年12月3日(火) 5:36 Andreas Mueller :
>
>> It does provide the ranking of features in the ranking_ attribute and it
>> provides the cross-validation accuracies for all subsets in grid_scores_.
>> It doesn't provide the feature weights for all subsets, but that's
>> something that would be easy to add if it's desired.
>>
>
> I would guess that there is some population of the user base that would
> like to track the per-iteration feature weights.
> It would appear to me that a straightforward (un-optimized) implementation
> would be place a NaN value for a feature once it is eliminated, so that a
> numpy.ndarray can be returned and immediately dumped to
> matplotlib.pcolormesh or other visualization routines in various libraries.
>
> Just an idea.
>
> J.B.
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] SVM-RFE

2019-12-03 Thread Brown J.B. via scikit-learn
2019年12月3日(火) 5:36 Andreas Mueller :

> It does provide the ranking of features in the ranking_ attribute and it
> provides the cross-validation accuracies for all subsets in grid_scores_.
> It doesn't provide the feature weights for all subsets, but that's
> something that would be easy to add if it's desired.
>

I would guess that there is some population of the user base that would
like to track the per-iteration feature weights.
It would appear to me that a straightforward (un-optimized) implementation
would be place a NaN value for a feature once it is eliminated, so that a
numpy.ndarray can be returned and immediately dumped to
matplotlib.pcolormesh or other visualization routines in various libraries.

Just an idea.

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] SVM-RFE

2019-11-25 Thread Brown J.B. via scikit-learn
2019年11月23日(土) 2:12 Andreas Mueller :

> I think you can also use RFECV directly without doing any wrapping.
>
> Your request to do performance checking of the steps of SVM-RFE is a
> pretty common task.
>
>
Yes, RFECV works well (and I should know as an appreciative long-time user
;-)  ), but does it actually provide a mechanism (accessors) for tracing
the step by step feature weights and predictive ability as the features are
continually reduced?
(Or perhaps it's because I'm looking at 0.20.1 and 0.21.2 documentation...?)

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] SVM-RFE

2019-11-19 Thread Brown J.B. via scikit-learn
Dear Malik,

Your request to do performance checking of the steps of SVM-RFE is a pretty
common task.

Since the contributors to scikit-learn have done great to make the
interface to RFE easy to use, the only real work required from you would be
to build a small wrapper function that:
(a) computes the step sizes you want to output prediction performances for,
and
(b) loops over the step sizes, making each step size the n_features
attribute of RFE (and built from the remaining features), making
predictions from a SVM retrained (and possibly optimized) on the reduced
feature set, and then outputting your metric(s) appropriate to your problem.

Tracing the feature weights is then done by accessing the "coef_" attribute
of the linear SVM trained.
This can be output in loop step (b) as well.

where each time 10% for the features are removed.
> How one can get the accuracy overall the levels of the elimination stages.
> For example, I want to get performance over 1000 features, 900 features,
> 800 features,,2 features, 1 feature.
>

Just a technicality, but by 10% reduction you would have
1000, 900, 810, 729, 656, ... .
Either way, if you allow your wrapper function to take a pre-computed list
of feature sizes, you can flexibly change between a systematic way or a
context-informed way of specifying feature sizes (and resulting weights) to
trace.

Hope this helps.

J.B. Brown
Kyoto University Graduate School of Medicine
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25

2019-10-13 Thread Brown J.B. via scikit-learn
Please, respect and refinement when addressing the contributors and users
of scikit-learn.

Gael's statement is perfect -- complexity does not imply better prediction.
The choice of estimator (and algorithm) depends on the structure of the
model desired for the data presented.
Estimator superiority cannot be proven in a context- and/or data-agnostic
fashion.

J.B.


2019年10月13日(日) 6:13 Mike Smith :

> "Second complexity does not
> > imply better prediction. "
>
> Complexity doesn't imply prediction? Perhaps you're having a translation
> error.
>
> On Sat, Oct 12, 2019 at 2:04 PM  wrote:
>
>> Send scikit-learn mailing list submissions to
>> scikit-learn@python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-requ...@python.org
>>
>> You can reach the person managing the list at
>> scikit-learn-ow...@python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith)
>>
>>
>> --
>>
>> Message: 1
>> Date: Sat, 12 Oct 2019 14:04:12 -0700
>> From: Mike Smith 
>> To: scikit-learn@python.org
>> Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24
>> Message-ID:
>> > 4lry2njvjwvvr4rg...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> "...  > If I should expect good results on a pc, scikit says that needing
>> gpu power is
>> > obsolete, since certain scikit models perform better (than ml designed
>> for gpu)
>> > that are not designed for gpu, for that reason. Is this true?"
>>
>> Where do you see this written? I think that you are looking for overly
>> simple stories that you are not true."
>>
>> Gael, see the below from the scikit-learn FAQ. You can also find this
>> yourself at the main FAQ:
>>
>> [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn
>> 0.21.3 documentation.png]
>>
>>
>> On Sat, Oct 12, 2019 at 9:03 AM  wrote:
>>
>> > Send scikit-learn mailing list submissions to
>> > scikit-learn@python.org
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> > or, via email, send a message with subject or body 'help' to
>> > scikit-learn-requ...@python.org
>> >
>> > You can reach the person managing the list at
>> > scikit-learn-ow...@python.org
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of scikit-learn digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >1. Re: Is scikit-learn implying neural nets are the best
>> >   regressor? (Gael Varoquaux)
>> >
>> >
>> > --
>> >
>> > Message: 1
>> > Date: Fri, 11 Oct 2019 13:34:33 -0400
>> > From: Gael Varoquaux 
>> > To: Scikit-learn mailing list 
>> > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are
>> > the best regressor?
>> > Message-ID: <20191011173433.bbywiqnwjjpvs...@phare.normalesup.org>
>> > Content-Type: text/plain; charset=iso-8859-1
>> >
>> > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote:
>> > > In other words, according to that arrangement, is scikit-learn
>> implying
>> > that
>> > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17?
>> >
>> > No.
>> >
>> > First they are not ordered in order of complexity (Naive Bayes is
>> > arguably simpler than Gaussian Processes). Second complexity does not
>> > imply better prediction.
>> >
>> > > If I should expect good results on a pc, scikit says that needing gpu
>> > power is
>> > > obsolete, since certain scikit models perform better (than ml designed
>> > for gpu)
>> > > that are not designed for gpu, for that reason. Is this true?
>> >
>> > Where do you see this written? I think that you are looking for overly
>> > simple stories that you are not true.
>> >
>> > > How much hardware is a practical expectation for running the best
>> > > scikit models and getting the best results?
>> >
>> > This is too vague a question for which there is no answer.
>> >
>> > Ga?l
>> >
>> > > On Fri, Oct 11, 2019 at 9:02 AM 
>> wrote:
>> >
>> > > Send scikit-learn mailing list submissions to
>> > > ? ? ? ? scikit-learn@python.org
>> >
>> > > To subscribe or unsubscribe via the World Wide Web, visit
>> > > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn
>> > > or, via email, send a message with subject or body 'help' to
>> > > ? ? ? ? scikit-learn-requ...@python.org
>> >
>> > > You can reach the person managing the list at
>> > > ? ? ? ? scikit-learn-ow...@python.org
>> >
>> > > When replying, please edit your Subject line so 

Re: [scikit-learn] Test Sample Size

2019-07-22 Thread Brown J.B. via scikit-learn
Dear Milton,

It is just my opinion based on many experiences, but if you want to
stress-test your estimator, make your test set at least as big as, if not
bigger than, the training set.

Sincerely,
J.B.

2019年7月22日(月) 22:18 Milton Pifano :

> Dear scikit-learn subscribers.
>
> I am working on a multiclass classificacition project and I have found
> many resources about how to deal with  an imbalaced dataset for trainning,
> bu I have not been able to find  any reference on the test dataset size.
> Can anyone send some references?
>
> Thanks,
> Milton Pifano
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Scikit Learn in a Cray computer

2019-06-28 Thread Brown J.B. via scikit-learn
>
> where you can see "ncpus = 1" (I still do not know why 4 lines were
> printed -
>
> (total of 40 nodes) and each node has 1 CPU and 1 GPU!
>


> #PBS -l select=1:ncpus=8:mpiprocs=8
> aprun -n 4 p.sh ./ncpus.py
>

You can request 8 CPUs from a job scheduler, but if each node the script
runs on contains only one virtual/physical core, then cpu_count() will
return 1.
If that CPU supports multi-threading, you would typically get 2.

For example, on my workstation:
`--> egrep "processor|model name|core id" /proc/cpuinfo
processor : 0
model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
core id : 0
processor : 1
model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
core id : 1
processor : 2
model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
core id : 0
processor : 3
model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
core id : 1
`--> python3 -c "from sklearn.externals import joblib;
print(joblib.cpu_count())"
4

It seems that in this situation, if you're wanting to parallelize
*independent* sklearn calculations (e.g., changing dataset or random seed),
you'll ask for the MPI by PBS processes like you have, but you'll need to
place the sklearn computations in a function and then take care of
distributing that function call across the MPI processes.

Then again, if the runs are independent, it's a lot easier to write a for
loop in a shell script that changes the dataset/seed and submits it to the
job scheduler to let the job handler take care of the parallel distribution.
(I do this when performing 10+ independent runs of sklearn modeling, where
models use multiple threads during calculations; in my case, SLURM then
takes care of finding the available nodes to distribute the work to.)

Hope this helps.
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How use get_depth

2019-06-17 Thread Brown J.B. via scikit-learn
Perhaps you mean:
DecisionTreeRegressor.tree_.max_depth , where DecisionTreeRegressor.tree_
is available after calling fit() ?


2019年6月17日(月) 22:29 Wendley Silva :

> Hi all,
>
> I tried several ways to use the get_depth() method from
> DecisionTreeRegression, but I always get the same error:
>
> self.clf.*get_depth()*
> AttributeError: *'DecisionTreeRegressor' object has no attribute
> 'get_depth'*
>
> I researched the internet and found no solution. Any idea how to use it
> correctly?
>
> *Description of get_depth():*
>
> https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
>
> Thanks in advance.
>
> Best,
> *Wendley S. Silva*
> Universidade Federal do Ceará - Brasil
>
>  +55 (88) 3695.4608
>  wend...@ufc.br
>  www.ec.ufc.br/wendley
>  Rua Cel. Estanislau Frota, 563, Centro, Sobral-CE, Brasil - CEP 62.0
> 10-560
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Brown J.B. via scikit-learn
2019年6月5日(水) 10:43 Brown J.B. :

> Contrast this to Pearson Product Moment Correlation (R), where the fit of
> the line has no requirement to go through the origin of the fit.
>

Not sure what I was thinking when I wrote that.
Pardon the mistake; I'm fully aware that Pearson R is merely a coefficient
merely indicating direction of trend.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-04 Thread Brown J.B. via scikit-learn
Dear CW,


> Linear regression is not a black-box. I view prediction accuracy as an
> overkill on interpretable models. Especially when you can use R-squared,
> coefficient significance, etc.
>

Following on my previous note about being cautious with cross-validated
evaluation for classification, the same applies for regression.
About 20 years ago, chemoinformatics researchers pointed out the caution
needed with using CV-based R^2 (q^2) as a measure of performance.
"Beware of q2!"  Golbraikh and Tropsha, J Mol Graph Modeling (2002) 20:269
https://www.sciencedirect.com/science/article/pii/S1093326301001231

In this article, they propose to measure correlation by using both
known-VS-predicted _and_ predicted-VS-known calculations of the correlation
coefficient, and importantly, that the regression line to fit in both cases
goes through the origin.
The resulting coefficients are checked as a pair, and the authors argue
that only if they are both high can one say that the model is fitting the
data well.

Contrast this to Pearson Product Moment Correlation (R), where the fit of
the line has no requirement to go through the origin of the fit.

I found the paper above to be helpful in filtering for more robust
regression models, and have implemented my own version of their method,
which I use as my first evaluation metric when performing regression
modelling.

Hope this provides you some thought.

Prediction accuracy also does not tell you which feature is important.
>

The contributions of the scikit-learn community have yielded a great set of
tools for performing feature weighting separate from model performance
evaluation.
All you need to do is read the documentation and try out some of the
examples, and you should be ready to adapt to your situation.

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-02 Thread Brown J.B. via scikit-learn
>
> As far as I understand: Holding out a test set is recommended if you
> aren't entirely sure that the assumptions of the model are held (gaussian
> error on a linear fit; independent and identically distributed samples).
> The model evaluation approach in predictive ML, using held-out data, relies
> only on the weaker assumption that the metric you have chosen, when applied
> to the test set you have held out, forms a reasonable measure of
> generalised / real-world performance. (Of course this too is often not held
> in practice, but it is the primary assumption, in my opinion, that ML
> practitioners need to be careful of.)
>

Dear CW,
As Joel as said, holding out a test set will help you evaluate the validity
of model assumptions, and his last point (reasonable measure of generalised
performance) is absolutely essential for understanding the capabilities and
limitations of ML.

To add to your checklist of interpreting ML papers properly, be cautious
when interpreting reports of high performance when using 5/10-fold or
Leave-One-Out cross-validation on large datasets, where "large" depends on
the nature of the problem setting.
Results are also highly dependent on the distributions of the underlying
independent variables (e.g., 6 datapoints all with near-identical
distributions may yield phenomenal performance in cross validation and be
almost non-predictive in truly unknown/prospective situations).
Even at 500 datapoints, if independent variable distributions look similar
(with similar endpoints), then when each model is trained on 80% of that
data, the remaining 20% will certainly be predictable, and repeating that
five times will yield statistics that seem impressive.

So, again, while problem context completely dictates ML experiment design,
metric selection, and interpretation of outcome, my personal rule of thumb
is to do no-more than 2-fold cross-validation (50% train, 50% predict) when
having 100+ datapoints.
Even more extreme, using try 33% for training and 66% for validation (or
even 20/80).
If your model still reports good statistics, then you can believe that the
patterns in the training data extrapolate well to the ones in the external
validation data.

Hope this helps,
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-02 Thread Brown J.B. via scikit-learn
As a user, I feel that (2) "sklearn.plot.XXX.plot_YYY" best allows for
future expansion of sub-namespaces in a tractable way that is also easy to
understand during code review.
For example, sklearn.plot.tree.plot_forest() or sklearn.plot.lasso.plot_* .

Just my opinion.
J.B.


2019年4月2日(火) 23:40 Hanmin Qin :

> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread Brown J.B. via scikit-learn
> Take random forest as example, if I give estimator from 10 to 1(10,
> 100, 1000, 1) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know
> lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
>

A simple but nonetheless practical solution is to
  (1) start with an upper bound on the number of trees you are willing to
accept in the model,
  (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference
point,
  (3) systematically lower the number of trees (log2 scale down, fixed size
decrement, etc)
  (4) obtain the reduced forest size performance,
  (5) Repeat (3)-(4) until [performance(reference) - performance(current
forest size)] > tolerance

You can encapsulate that in a function which then returns the final model
you obtain.
>From the model object, the number of trees can be obtained.

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Difference between linear model and tree-based regressor?

2018-12-13 Thread Brown J.B. via scikit-learn
"Elements of Statistical Learning" is on my bookshelf, but even so, that
was a great summary!
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Brown J.B. via scikit-learn
As an end-user, I would strongly support the idea of future enforcement of
keyword arguments for new parameters.
In my group, we hold a standard that we develop APIs where _all_ arguments
must be given by keyword (slightly pedantic style, but has shown to have
benefits).
Initialization/call-time state checks are done by a class' internal methods.

As Andy said, one could consider leaving prototypical X,y as positional,
but one benefit my group has seen with full keyword parameterization is the
ability to write code for small investigations where we are more concerned
with effects from parameters rather than the data (e.g., a fixed problem to
model, and one wants to first see on the code line what the estimators and
their parameterizations were).
If one could shift the sklearn X,y to the back of a function call, it would
enable all participants in a face-to-face code review session to quickly
see the emphasis/context of the discussion and move to the conclusion
faster.

To satisfy keyword X,y as well, I would presume that the BaseEstimator
would need to have a sanity check for error-raising default X,y values --
though does it not have many checks on X,y already?

Not sure if everyone else agrees about keyword X and y, but just a thought
for consideration.

Kind regards,
J.B.

2018年11月15日(木) 18:34 Gael Varoquaux :

> I am really in favor of the general idea: it is much better to use named
> arguments for everybody (for readability, and to be less depend on
> parameter ordering).
>
> However, I would maintain that we need to move slowly with backward
> compatibility: changing in a backward-incompatible way a library brings
> much more loss than benefit to our users.
>
> So +1 for enforcing the change on all new arguments, but -1 for changing
> orders in the existing arguments any time soon.
>
> I agree that it would be good to push this change in existing models. We
> should probably announce it strongly well in advance, make sure that all
> our examples are changed (people copy-paste), wait a lot, and find a
> moment to squeeze this in.
>
> Gaël
>
> On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote:
> > We could just announce that we will be making this a syntactic
> constraint from
> > version X and make the change wholesale then. It would be less formal
> backwards
> > compatibility than we usually hold by, but we already are loose with
> parameter
> > ordering when adding new ones.
>
> > It would be great if after this change we could then reorder parameters
> to make
> > some sense!
>
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Can I use Sklearn Porter to Generate C++ version of Random Forest Predict function

2018-11-01 Thread Brown J.B. via scikit-learn
I, too, would be curious to know if anyone has any experience in doing this.
J.B.

2018年11月1日(木) 2:07 Chidhambaranathan R :

> Hi,
>
> I'd like to know if I can use sklearn_porter to generate the C++ version
> of Random Forest Regression Predict function. If sklearn_porter doesn't
> work, is there any possible alternatives to  generate c++ implementation of
> RF Regressor Predict function?
>
> Thanks.
>
> --
> Regards,
> Chidhambaranathan R,
> PhD Student,
> Electrical and Computer Engineering,
> Utah State University
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Dimension Reduction - MDS

2018-10-11 Thread Brown J.B. via scikit-learn
Hi Guillaume,

The good news is that your script works as-is on smaller datasets, and
hopefully does the logic for your task correctly.

In addition to Alex's comment about data size and MDS tractability, I would
also point out a philosophical issue -- why consider MDS for such a large
dataset?
At least in two dimensions, once MDS gets beyond 1000 samples or so, the
resulting sample coordinates and its visualization are potentially highly
dispersed (e.g.,  like a 2D-uniform distribution) and may not lead to
interpretability.
One can move to three-dimensional MDS, but perhaps even then a few thousand
samples gets to the limit of graphical interpretability.
It very obviously depends on the relationships in your data.

Also, as you continue your work, keep in mind that the per-sample
dimensionality (number of entries in a single sample's descriptor vector)
will not be the primary determinant of the memory consumption requirements
for the MDS algorithm, because in any case you must compute (either inline
or pre-compute) the distance matrix between each pair of samples, and that
matrix stays in memory during coordinate generation (as far as I know).
So, 10 chemical descriptors (since I noticed you mentioning Dragon) or 1000
descriptors will still result in the same memory requirement for the
distance matrix, and then scaling to hundreds of thousands of samples will
eat all of the compute node's RAM.

Since you have 200k samples, you could potentially do some type of repeated
partial clustering (e.g., on random subsamples of data) to find a
reasonable number of clusters per repetition, analyze those results to make
an estimate of a number of clusters for a global clustering, and then
select a limited number of samples per cluster to use for projection to a
coordinate space by MDS.
Or a diversity selection (either by vector distance or in your case,
differing compound scaffolds) may be a way to get a quick subset and
visualize distance relationships.

Hope this helps.

Sincerely,
J.B. Brown

2018年10月11日(木) 20:14 Alexandre Gramfort :

> hi Guillaume,
>
> I cannot use our MDS solver at this scale. Even if you fit it in RAM
> it will be slow.
>
> I would play with https://github.com/lmcinnes/umap unless you really
> what a classic MDS.
>
> Alex
>
> On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier
>  wrote:
> >
> > Hello J.B,
> >
> > Thank you for your quick reply.
> >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > So I tried the same script while increasing the number of samples (100,
> > 1000 and 1) and it works indeed without swapping on my workstation.
> >
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > I thought that even 49M entries of doubles would be able to be processed
> > with 64G of RAM. Is there something to configure to allow this
> computation?
> >
> > The typical datasets I use can have around 200-300k rows with a few
> columns
> > (usually up to 3).
> >
> > Best regards,
> >
> > Guillaume
> >
> > Quoting "Brown J.B. via scikit-learn" :
> >
> > > Hello Guillaume,
> > >
> > > You are computing a distance matrix of shape 7x7 to generate
> MDS
> > > coordinates.
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > >
> > > Hope this helps you get started.
> > > J.B.
> > >
> > > 2018年10月9日(火) 18:22 Guillaume Favelier :
> > >
> > >> Hi everyone,
> > >>
> > >> I'm trying to use some dimension reduction algorithm [1] on my dataset
> > >> [2] in a
> > >> python script [3] but for some reason, Python seems to consume a lot
> of my
> > >> main memory and even swap on my configuration [4] so I don't have the
> > >> expected result
> > >> but a memory error instead.
> > >>
> > >> I have the impression that this behaviour is not intended so can you
> > >> help me know
> > >> what I did wrong or miss somewhere please?
> > >>
> > >> [1]: MDS -
> > >>
> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
> > >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
> > >> [3]: dragon.py - 10 lines
> > >> [

Re: [scikit-learn] Dimension Reduction - MDS

2018-10-09 Thread Brown J.B. via scikit-learn
Hello Guillaume,

You are computing a distance matrix of shape 7x7 to generate MDS
coordinates.
That is 49,000,000 entries, plus overhead for a data structure.

If you try with a very small (e.g., 100 sample) data file, does your code
employing MDS work?
As you increase the number of samples, does the script continue to work?

Hope this helps you get started.
J.B.

2018年10月9日(火) 18:22 Guillaume Favelier :

> Hi everyone,
>
> I'm trying to use some dimension reduction algorithm [1] on my dataset
> [2] in a
> python script [3] but for some reason, Python seems to consume a lot of my
> main memory and even swap on my configuration [4] so I don't have the
> expected result
> but a memory error instead.
>
> I have the impression that this behaviour is not intended so can you
> help me know
> what I did wrong or miss somewhere please?
>
> [1]: MDS -
> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
> [3]: dragon.py - 10 lines
> [4]: dragon_swap.png - htop on my workstation
>
> TAR archive:
> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn
>
> Best regards,
>
> Guillaume Favelier
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Bootstrapping in sklearn

2018-09-18 Thread Brown J.B. via scikit-learn
Resampling is a very important interesting contribution which relates very
closely to my primary research in applied ML for chemical development.
I'd be very interested in contributing documentation and learning new
things along the way, but I potentially would be perceived as slow because
of juggling many projects and responsibilities.
(I failed once before at timely reviewing of a PR for multi-metric
optimization for 0.19.)
If still acceptable, please let me know, and I'm happy to try to help.

J.B.


2018年9月18日(火) 20:37 Daniel Saxton :

> Great, I went ahead and contacted Constantine.  Documentation was actually
> the next thing that I wanted to work on, so hopefully he and I can put
> something together.
>
> Thanks for the help.
>
> On Tue, Sep 18, 2018 at 2:42 AM Olivier Grisel 
> wrote:
>
>> This looks like a very useful project.
>>
>> There is also scikits-bootstraps [1]. Personally I prefer the flat
>> package namespace of resample (I am not a fan of the 'scikits' namespace
>> package) but I still think it would be great to contact the author to know
>> if he would be interested in joining efforts.
>>
>> What currently lacks from both projects is a good sphinx-based
>> documentation that explains in a couple of paragraphs with examples what
>> are the different non-parametric inference methods, what are the pros and
>> cons for each of them (sample complexity, computation complexity, kinds of
>> inference, bias, theoretical asymptotic results, practical discrepancies
>> observed in the finite sample setting, assumptions made on the distribution
>> of the data...) and ideally the doc would have reference to examples (using
>> sphinx-gallery) that would highlight the behavior of the tools in both
>> nominal and pathological cases.
>>
>> [1] https://github.com/cgevans/scikits-bootstrap
>>
>> --
>> Olivier
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using GPU in scikit learn

2018-08-08 Thread Brown J.B. via scikit-learn
Dear Ta Hoang,

GPU processing can be done with Python libraries such as TensorFlow, Keras,
or Theano.

However, sklearn's implementation of RandomForestClassifier is
outstandingly fast, and a previous effort to develop GPU RandomForest
abandoned their efforts as a result:
https://github.com/EasonLiao/CudaTree

If you need to speed up predictions because of a large dataset, you can
combine joblib with sklearn to utilize parallelize the predictions of the
individual trees:

from joblib import Parallel, delayed
...
predictions = Parallel(n_jobs=n,
backend=backend)(delayed(your_forest_prediction_func)(func_arguments) for
tree_group in tree_groups))

where, n is how many parallel computations you want to execute, and backend
is either "threading" or "multiprocessing".
Typically, your_forest_predict_func() would iterate over the collection of
trees and prediction objects given in func_arguments using a single
thread/process.

Hope this helps you parallelize and speed-up.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine



2018-08-09 9:50 GMT+09:00 hoang trung Ta :

> Dear all members,
>
> I am using Random forest for classification satellite images. I have a
> bunch of images, thus the processing is quite slow. I searched on the
> Internet and they said that GPU can accelerate the process.
>
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
>
> Do you know how to use GPU in Scikit learn, I mean the packages to use and
> sample code that used GPU in random forest classification?
>
> Thank you very much
>
> --
> *Ta Hoang Trung (Mr)*
>
> *Master student*
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
>
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung...@alumni.tsukuba.ac.jp
>  tahoangtr...@gmail.com
>  s1626...@u.tsukuba.ac.jp
>
> **
> *Mapping Technician*
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
>
> Mobile: +84 1255151344
> Email : tahoangtr...@gmail.com
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Plot Cross-validated ROCs for multi-class classification problem

2018-07-21 Thread Brown J.B. via scikit-learn
Hello Makis,

2018-07-20 23:44 GMT+09:00 Andreas Mueller :

> There is no single roc curve for a 3 class problem. So what do you want to
> plot?
>
> On 07/20/2018 10:40 AM, serafim loukas wrote:
>
> What I want to do is to plot the average(mean) ROC across Folds for a
> 3-class case.
>
>
The prototypical ROC curve uses True Positive Rate and False Positive Rate
for its axes, so it is for 2-class problems, and not for 3+-class problems,
as Andy mentioned.
Perhaps you are wanting the mean and confidence intervals of the n-class
Cohen Kappa metric as estimated by either many folds of cross validation,
or you want to evaluate your classifier by repeated subsampling experiments
and Kappa value distribution/histogram?

Hope this helps,
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] sample_weights in RandomForestRegressor

2018-07-16 Thread Brown J.B. via scikit-learn
Dear Thomas,

Your strategy for model development is built on the assumption that the SAR
(structure-activity relationship) is a continuous manifold constructed for
your compound descriptors.
However, SARs for many proteins in drug discovery or chemical biology are
not continuous (consider kinase inhibitors).

Therefore, you must make an assessment of the training data SAR to check
for the prevalence of activity cliffs.
There are at least two ways you can go about this:
  (1) Simply compute all pairwise similarities by your choice of
descriptor+metric, then identify where there are pairs (e.g.,
MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50
difference of more than 10/50/100-fold; again, the biology of your problem
determines the right values).
  (2) Perform many repetitions of train-test splitting on the 709 reference
molecules, look at the distribution of your evaluation metric, and see if
there is a limit in your ability to predict. If you are hitting a wall in
terms of predictability (metric performance), it's a likely sign there is
an activity cliff, and no amount of machine learning is going to be able to
overcome this. Further, trace the predictability of individual compounds to
identify those which consistently are predicted wrong.  If you combine this
with analysis (1), you can know exactly which of your chemistries are
unmodelable.

If you find that there are no activity cliffs in your dataset, then your
application of the assumption that chemical similarity implies biological
endpoint similarity will hold, and your experimental design is validated
because of the presence of a continuous manifold.
However, if you do have activity cliffs, then as awesome as sklearn is, it
still cannot make the computational chemistry any better.

Hope this helps you contextualize your work. Don't hesitate to contact me
if I can be of consultation.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-07-16 8:51 GMT+09:00 Thomas Evangelidis :

> ​​
> Hello,
>
> I am kind of confused about the use of sample_weights parameter in the
> fit() function of RandomForestRegressor. Here is my problem:
>
> I am trying to predict the binding affinity of small molecules to a
> protein. I have a training set of 709 molecules and a blind test set of 180
> molecules. I want to find those features that are more important for the
> correct prediction of the binding affinity of those 180 molecules of my
> blind test set.  My rationale is that if I give more emphasis to the
> similar molecules in the training set, then I will get higher importances
> for those features that have higher predictive ability for this specific
> blind test set of 180 molecules. To this end, I weighted the 709 training
> set molecules by their maximum similarity to the 180 molecules, selected
> only those features with high importance and trained a new RF with all 709
> molecules. I got some results but I am not satisfied. Is this the right way
> to use sample_weights in RF. I would appreciate any advice or suggested
> work flow.
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] PyCM: Multiclass confusion matrix library in Python

2018-06-05 Thread Brown J.B. via scikit-learn
2018-06-05 1:06 GMT+09:00 Andreas Mueller :

> Is that Jet?!
>
> https://www.youtube.com/watch?v=xAoljeRJ3lU
>
> ;)
>

Quite an entertaining presentation and informative to the non-expert about
color theory, though I'm not sure I'd go so far as to call jet "evil" and
that everyone hates it.
Actually, I didn't know that the colormap known as Jet actually had a
name...I had reversed engineered it to reproduce what I saw elsewhere.
I suppose I'm glad I have already built my infrastructure's version of the
metric surface plotter to allow complete color customization at runtime
from the CLI, and can then tailor results to my audiences. :)

I'll keep this video's explanation in mind - thanks for the reference.

Cheers,
J.B.



> On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote:
>
> Hello community,
>
> I wonder if there's something similar for the binary class case where,
>>> the prediction is a real value (activation) and from this we can also
>>> derive
>>>   - CMs for all prediction cutoff (or set of cutoffs?)
>>>   - scores over all cutoffs (AUC, AP, ...)
>>>
>> AUC and AP are by definition over all cut-offs. And CMs for all
>> cutoffs doesn't seem a good idea, because that'll be n_samples many
>> in the general case. If you want to specify a set of cutoffs, that would
>> be pretty easy to do.
>> How do you find these cut-offs, though?
>>
>>>
>>> For me, in analyzing (binary class) performance, reporting scores for
>>> a single cutoff is less useful than seeing how the many scores (tpr,
>>> ppv, mcc, relative risk, chi^2, ...) vary at various false positive
>>> rates, or prediction quantiles.
>>>
>>
> In terms of finding cut-offs, one could use the idea of metric surfaces
> that I recently proposed
> https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127
> and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces
> to determine what conditions you are willing to accept against the
> background of your prediction problem.
>
> I use these surfaces (a) to think about the prediction problem before any
> attempt at modeling is made, and (b) to deconstruct results such as
> "Accuracy=85%" into interpretations in the context of my field and the data
> being predicted.
>
> Hope this contributes a bit of food for thought.
> J.B.
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] PyCM: Multiclass confusion matrix library in Python

2018-06-04 Thread Brown J.B. via scikit-learn
Hello community,

I wonder if there's something similar for the binary class case where,
>> the prediction is a real value (activation) and from this we can also
>> derive
>>   - CMs for all prediction cutoff (or set of cutoffs?)
>>   - scores over all cutoffs (AUC, AP, ...)
>>
> AUC and AP are by definition over all cut-offs. And CMs for all
> cutoffs doesn't seem a good idea, because that'll be n_samples many
> in the general case. If you want to specify a set of cutoffs, that would
> be pretty easy to do.
> How do you find these cut-offs, though?
>
>>
>> For me, in analyzing (binary class) performance, reporting scores for
>> a single cutoff is less useful than seeing how the many scores (tpr,
>> ppv, mcc, relative risk, chi^2, ...) vary at various false positive
>> rates, or prediction quantiles.
>>
>
In terms of finding cut-offs, one could use the idea of metric surfaces
that I recently proposed
https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127
and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces
to determine what conditions you are willing to accept against the
background of your prediction problem.

I use these surfaces (a) to think about the prediction problem before any
attempt at modeling is made, and (b) to deconstruct results such as
"Accuracy=85%" into interpretations in the context of my field and the data
being predicted.

Hope this contributes a bit of food for thought.
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Announcing modAL: a modular active learning framework

2018-02-19 Thread Brown J.B. via scikit-learn
Dear Dr. Danka,

This is a very nice generalization you have built.

My group and I have published multiple papers on using active learning for
drug discovery model creation, built on top of scikit-learn.
(2017) Future Med Chem : https://dx.doi.org/10.4155/fmc-2016-0197 (*Most
downloaded paper of the year) (Open Access)
(2017) J Comput-Aided Chem : https://dx.doi.org/10.2751/jcac.18.124  (Open
Access)
(2018) ChemMedChem : https://dx.doi.org/10.1002/cmdc.201700677

In our work, we built a similar framework to modAL, though in our framework
the iterative model building is done on a fully labeled (Y) set of
examples, and we are more interested in knowing:
  (1) How fast learning converges within some convergence criteria (e.g.,
how many drugs must be in a model, given an evaluation metric),
  (2) Which examples are picked across repeated executions of AL (e.g.,
which drugs appear to be the most informative for model construction),
  (3) How much diversity is there in the examples picked (e.g., how
different are the drugs selected by AL - visualized in the 2017
FutureMedChem paper), and
  (4) How dependent are actively learned models on descriptors (e.g., do
different representations affect the speed of performance convergence?).

I think some, if not all, of these questions are also answerable in your
framework.

Also, with regards to point (1) and evaluation metrics, I recently came up
with an idea to generically analyze the nature of 2-class prediction
performance metrics independent of the model methodology used:
(2018) Molecular Informatics : https://dx.doi.org/10.1002/minf.201700127
(Open Access)
You can find the philosophy of this article embedded in the active learning
experiments performed in the 2018 ChemMedChem article.

If you or anyone else on this list is interested in active learning and
chemistry, please drop me a line.

Again - very nice job, and best wishes for continued development.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-02-19 16:45 GMT+09:00 Tivadar Danka :

> Dear scikit-learn community!
>
> It is my pleasure to announce modAL, a modular active learning framework
> for Python3, built on top of scikit-learn. Designed with modularity,
> flexibility and extensibility in mind, it allows the rapid development of
> active learning workflows with nearly complete freedom. It is aimed for
> researchers and practitioners, where fast prototyping is essential for
> testing and developing active learning pipelines.
>
> modAL is quite young and under constant improvement. Any feedback, feature
> request or contribution are very welcome!
>
> The package can be installed via pip:
> pip3 install modAL
>
> The repository, tutorials and documentation are available at
>- GitHub: https://github.com/cosmic-cortex/modAL
>- Webpage: https://cosmic-cortex.github.io/modAL
>
> Cheers,
> Tivadar
>
> --
> Tivadar Danka
> postdoctoral researcher
> BIOMAG group, MTA-BRC
> http://www.tivadardanka.com
> twitter: @TivadarDanka 
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread Brown J.B. via scikit-learn
Dear Yang Li,

> Neither the classificationTree nor the regressionTree supports
categorical feature. That means the Decision trees model can only accept
continuous feature.

Consider either manually encoding your categories in bitstrings (e.g.,
"Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder
to do the same thing for you automatically.

Cheers,
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier as a feature selector

2017-12-06 Thread Brown J.B. via scikit-learn
I am also very interested in knowing if there is a sklearn cookbook
solution for getting the weights of a one-hidde-layer MLPClassifier.
J.B.

2017-12-07 8:49 GMT+09:00 Thomas Evangelidis :

> Greetings,
>
> I want to train a MLPClassifier with one hidden layer and use it as a
> feature selector for an MLPRegressor.
> Is it possible to get the values of the neurons from the last hidden layer
> of the MLPClassifier to pass them as input to the MLPRegressor?
>
> If it is not possible with scikit-learn, is anyone aware of any
> scikit-compatible NN library that offers this functionality? For example
> this one:
>
> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>
> I wouldn't like to do this in Tensorflow because the MLP there is much
> slower than scikit-learn's implementation.
>
>
> Thomas
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn Digest, Vol 19, Issue 37

2017-10-17 Thread Brown J.B. via scikit-learn
2017-10-18 12:18 GMT+09:00 Ismael Lemhadri :

> How about editing the various chunks of code concerned to add the option
> to scale the parameters, and set it by default to NOT scale? This would
> make what happens clear without the redundancy Andreas mentioned, and would
> add more convenience to the user shall they want to scale their data.
>

>From my perspectives:

That's a very nice, rational idea.
For end users, it preserves compatibility of existing codebases, but allows
both near-effortless updating of code for those who want to use
Scikit-learn's scaling as well as ease of application for new users and
tools.

One issue of caution would be where the scaling occurs, such as globally
before any cross-validation, or per-split with the transformation stored
and applied to prediction data per fold of CV.
One more keyword argument would need to be added to allow user
specification of this, and a state variable would have to be stored and
accessible from the methods of the parent estimator.

J.B.



>
>
>> Today's Topics:
>>
>>1. Re: Unclear help file about sklearn.decomposition.pca (Raphael C)
>>
>>
>> --
>>
>> Message: 1
>> Date: Tue, 17 Oct 2017 16:44:55 +0100
>> From: Raphael C 
>> To: Scikit-learn mailing list 
>> Subject: Re: [scikit-learn] Unclear help file about
>> sklearn.decomposition.pca
>> Message-ID:
>> 

Re: [scikit-learn] Remembering Raghav, our friend, and a scikit-learn contributor

2017-10-06 Thread Brown J.B. via scikit-learn
This is truly, truly sad news.
Leaving the home country you grew up in to find your way in a new language
and culture takes considerable effort, and to thrive at it takes even more
effort.
He was to be commended for that.

I think many of us knew of his enthusiasm for the project and benefited
greatly from it.
May his family and friends know of his contribution, and may he rest
peacefully.

J.B. Brown

2017-10-06 21:04 GMT+09:00 Gael Varoquaux :

> Raghav was a core contributor to scikit-learn. Venkat Raghav Rajagopalan,
> or @raghavrv  -as we knew him- appeared out of the blue and started
> contributing early 2015. From Chennai, he was helping us make scikit-learn
> a better library. As often in open source, he was working with people that
> he had never met in person, to improve a tool used by the whole world. He
> successfully completed a Google summer of code for the project that year,
> and then was hired as a full time engineer to work on the project in Paris.
> Raghav was excited to join the scikit-learn team. When he became core
> contributor, in 2016 he said that it was a highlight of his year.
>
> In Paris, we got to know him and enjoy him. Raghav was a very enthusiastic
> and easygoing person. It was a delight to have him around. For
> scikit-learn, he was a huge driver. He tackled a large number of issues,
> include tedious and difficult ones such as revamping our cross-validation
> API, multiple-metrics support in grid-search, or 32bit support in various
> models.
>
> Raghav had left India to live an adventure in a new culture. Curiosity and
> goal-driven, he had found his own way. He was growing fast, moving from
> student to expert, on his way to a bright future.
>
> Raghav passed away a month ago. We have been in shock and sorrow here in
> Paris. He will be deeply missed.
>
> Gael Varoquaux and Alexandre Gramfort
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn