Re: [scikit-learn] Scikit-learn PR: "Linux pymin_conda_defaults_openblas" "Upload to Codecov" Failed

2024-04-11 Thread Gael Varoquaux
We're happy that you figured this out. Thanks for your perseverance, and sorry 
for not having suggestions.

Best,

Gaël

On Thu, Apr 11, 2024 at 07:22:06PM +0100, H C wrote:
> Just a follow-up saying I fixed the error by contacting the Codecov staff and
> it was a Codecov server error. Did an empty commit and the checks were
> successful.
> Sorry for the inconveninence.

> H C <[1]henrique.caroc...@gmail.com> escreveu (sábado, 6/04/2024 à(s) 18:20):


> Good afternoon, 

> When I did a Pull Request to scikit-learn, it failed the pipeline "Linux
> pymin_conda_defaults_openblas" "Upload to Codecov" section, with error:

> "[2024-04-05T12:52:37.829Z] ['error'] There was an error running the
> uploader: Error uploading to [2]https://codecov.io: Error: There was an
> error fetching the storage URL during POST: 503 - upstream connect error 
> or
> disconnect/reset before headers. reset reason: connection failure

> [2024-04-05T12:52:37.829Z] ['verbose'] The error stack is: Error: Error
> uploading to [3]https://codecov.io: Error: There was an error fetching the
> storage URL during POST: 503 - upstream connect error or disconnect/reset
> before headers. reset reason: connection failure at main (/snapshot/repo/
> dist/src/index.js) at process.processTicksAndRejections (node:internal/
> process/task_queues:95:5) [2024-04-05T12:52:37.829Z] ['verbose'] End of
> uploader: 474 milliseconds

> ##[error]Bash exited with code '255'."

> Everyting else passed, any idea of what could've caused the fail?

> I expected everything to pass. I strongly believe I did everything as
> stated in the Contributing Section. The only thing I didn't do was change
> the changelog, but in the PR it states "Check Changelog / A reviewer will
> let you know if it is required or can be bypassed (pull_request)" .

> The code tests also passed, both locally and in the pipelines (including a
> test I added).


> Error link: [4]https://dev.azure.com/scikit-learn/scikit-learn/_build/
> results?buildId=65586=logs=66042141-7fd2-581d-812e-1a1b1d5e0f0c=
> 27a9f5da-2f14-5f39-0a4f-501f786ad84b


> Thank you,

> Henrique.


> References:

> [1] mailto:henrique.caroc...@gmail.com
> [2] https://codecov.io/
> [3] https://codecov.io/
> [4] 
> https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=65586=logs=66042141-7fd2-581d-812e-1a1b1d5e0f0c=27a9f5da-2f14-5f39-0a4f-501f786ad84b

> _______
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Permission to print two of your images

2024-03-04 Thread Gael Varoquaux
Dear Rosita,

The first image is not be us, but by datacamp, so I cannot comment on it.

With regards to the second image, it is covered by our BSD licence. Hence you 
have our permission to use it.

Best,

Gaël

On Mon, Mar 04, 2024 at 07:51:20AM +, DI LEO Rosita via scikit-learn wrote:
> Dear Ms./Mr.,

> My name is Rosita Di Leo, I am a working student at the University FH JOANNEUM
> in Graz, Austria. 
> I am writing this email to kindly ask you the permission to print and display
> two of the pictures your company created, in our institution.

> Thank you for your time and consideration.

> Kind regards,
> Rosita Di Leo


> FH JOANNEUM Gesellschaft mbH
> Rechtsform/Legal form: GmbH
> Sitz: Graz
> Firmenbuchgericht/Court of registry: Landesgericht für ZRS Graz
> Firmenbuchnummer/Company registration: FN 125888 f
> UID-Nr.: ATU 42361001
> https://www.fh-joanneum.at/impressum

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] New core contributor: Yao Xiao

2024-02-19 Thread Gael Varoquaux
I'm also super happy to have you around Yao! I've really enjoyed your work.

Gaël

On Mon, Feb 19, 2024 at 09:04:10AM +0100, Adrin wrote:
> Excited to have you on board Yao! Thanks for your contributions.

> On Mon, Feb 19, 2024, 08:37 Guillaume Lemaître <[1]g.lemaitr...@gmail.com>
> wrote:

> We are excited to welcome Yao Xiao ([2]https://github.com/Charlie-XIAO) as
> a core contributor of the scikit-learn project.

> Your past contributions are greatly appreciated, and I'm looking forward 
> to
> working further with you.

> On behalf of the scikit-learn team.
-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] video: scikit-learn's Past, Present and Future — with scikit-learn co-founder Dr. Gaël Varoquaux

2023-12-07 Thread Gael Varoquaux
Hi Reshama,

Thanks for putting me in contact with Jon, it was a create experience. I 
haven't had time to listen to the video myself. I hope that I haven't said too 
much nonsense :$

G

On Thu, Dec 07, 2023 at 11:15:09AM -0500, Reshama Shaikh wrote:
> Hello,


> In this episode, Gaël details:

> • The genesis, present capabilities and fast-moving future direction of
> scikit-learn.

> • How to best apply scikit-learn to your particular ML problem.

> • How ever-larger datasets and GPU-based accelerations impact the scikit-learn
> project.

> • How (whether you write code or not!) you can get started on contributing to 
> a
> mega-impactful open-source project like scikit-learn yourself.

> • Hugely successful social-impact data projects his Soda lab has had recently.

> • Why statistical rigor is more important than ever and how software tools
> could nudge us in the direction of making more statistically sound decisions.

> VIDEO interview: https://www.jonkrohn.com/posts/2023/12/5/
> scikit-learns-past-present-and-future-with-scikit-learn-co-founder-dr-gal-varoquaux

> 
> Best,
> Reshama Shaikh

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Request / Proposal: integrating IEEE paper in scikit-learn as "feature_selection.EFS / EFSCV" and cancer_benchmark datasets

2023-09-24 Thread Gael Varoquaux
or is open to discussion)
> has some first thought like this: 
>  
> RFE has:

> feature_selection.RFE   Feature ranking with recursive feature
> (estimator, *[, ...])   elimination.

> feature_selection.RFECV Recursive feature elimination with
> (estimator, *[, ...])   cross-validation to select features.

>  The "EFS" could have:

> Feature ranking and feature elimination with 8
> feature_selection.EFS   different algorithms, SFE, SFE-PSO etc. <- new
> (estimator, *[, ...])   algorithms could be added and benchmarked with
> evolutionary computing, swarm, genetic etc.

> feature_selection.EFSCV Feature elimination with cross-validation to
> (estimator, *[, ...])   select features

> Looking forward to an open discussion and if Evolutionary Feature
> Selection EFS is something for sklearn project, or maybe a separate 
> pip
> install package. 

> Kind regards
> Dalibor Hrg
> https://www.linkedin.com/in/daliborhrg/

> On Fri, Sep 22, 2023 at 10:50 AM Behrooz Ahadzade 
>  > wrote:



> Dear Dalibor Hrg,

> Thank you very much for your attention to the SFE algorithm. Thank
> you very much for the time you took to guide me and my colleagues.
> According to your guidance, we will add this algorithm to the
> scikit-learn library as soon as possible.

> Kind regards,
> Ahadzadeh.
> On Wednesday, September 13, 2023 at 12:22:04 AM GMT+3:30, Dalibor
> Hrg  wrote:


> Dear Authors,

> you have done some amazing work on feature selection here 
> published
> in IEEE: https://arxiv.org/abs/2303.10182 . I have noticed Python
> code here without a LICENSE file or any info on this: https://
> github.com/Ahadzadeh2022/SFE and in the paper some links are
> mentioned to download data.

> I would be interested with you that we:

> Step 1) make and release a pip package, publish this code in JOSS 
> https://joss.readthedocs.io i.e. https://joss.theoj.org/papers/
> 10.21105/joss.04611 under BSD-3 license and replicate IEEE paper
> table results. All 8 algorithms could be in potentially one class
> "EFS" meaning "Evolutionary Feature Selection", selectable as 8
> options among them SFE. Or something like that.  
>   
> Step 2) try integrate and work with scikit-learn people, I would
> recommend it to integrate this under https://scikit-learn.org/
> stable/modules/classes.html#module-sklearn.feature_selection
>  similarly to sklearn.feature_selection.RFE. I believe this would
> be a great contribution to the best open library for ML,
> scikit-learn. 

> I am unsure what is the status of datasets and licenses therein?.
> But, the datasets could be fetched externally from OpenML.org
> repository, for example https://scikit-learn.org/stable/datasets/
> loading_other_datasets.html or CERN Zenodo where "benchmark
> datasets" could be expanded. It depends a bit on the dataset
> licenses? 

> Overall, I hope this can hugely maximize your published work
> visibility but also for others to credit you in papers in a more
> citable and replicable way. I believe your IEEE paper and work
> definitely deserve a spot in scikit-learn. There is need for some
> replicable code on "Evolutionary Methods for Feature Selection" 
> and
> such Benchmark in life-science datasets, and you have done some
> great work so far.

> Let me know what you think. 

> Best regards,
> Dalibor Hrg

> https://www.linkedin.com/in/daliborhrg/


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Welcome to our new documentation team!

2023-08-12 Thread Gael Varoquaux
I'm so happy about this!! Thanks Lucy and Arturo !!!

Gaël

On Sat, Aug 12, 2023 at 01:48:02PM +0200, Adrin wrote:
> Hi there,

> We're excited to announce the addition of a new documentation team to our
> governance. This team includes Lucy Liu and Arturo Amor to start with. We're
> very happy to have them on board as they've been helping us and contributing 
> to
> the project quite a lot.

> We hope that this change better recognizes their efforts and contributions, 
> and
> that it'll further empower them in what they do.

> Regards,
> Adrin

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] scikit-learn 1.3.0rc1 is online!

2023-06-16 Thread Gael Varoquaux
Yes indeed. Thank you heaps Jérémie!!

Gaël

On Fri, Jun 16, 2023 at 08:09:23AM +0200, Guillaume Lemaître wrote:
> Thank you Jeremie for taking care of this release.

> On Thu, 15 Jun 2023 at 21:33, Jeremie du Boisberranger <
> jeremie.du-boisberran...@inria.fr> wrote:

> Hi everyone,

> Please help us test the first release candidate for scikit-learn 1.3.0:

>     pip install scikit-learn==1.3.0rc1

> Changelog:https://scikit-learn.org/1.3/whats_new/v1.3.html

> In particular, if you maintain a project with a dependency on
> scikit-learn, please let us know about any regression.

> Thanks to everyone who contributed to this release!


> Jérémie,

> on behalf of the scikit-learn maintainer team.

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] classification model that can handle missing values w/o learning from missing values

2023-03-03 Thread Gael Varoquaux
On Fri, Mar 03, 2023 at 10:22:04AM +, Martin Gütlein wrote:
> > 2. Ignores whether a value is missing or not for the inference
> What I meant is rather, that the missing value should NOT be treated as
> another possible value of the variable (this is e.g., what the
> HistGradientBoostingClassifier implementation in sk-learn does). Instead,
> multiple predictions could be done when a split-attribute is missing, and
> those can be averaged.

> This is how it is e.g. implemented in WEKA (we cannot switch do Java, though
> ;-):
> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
> and described by the inventors of the RF:
> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

The text that you link to describes two types of strategies, one that is 
similar to that done in HistGradientBoosting, the other one that amounts to 
imputation using a forest, and can be done in scikit-learn by setting up the 
IteratuiveImputer to use forests as a base learner (this will however be slow).

Cheers,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] classification model that can handle missing values w/o learning from missing values

2023-03-02 Thread Gael Varoquaux
Dear Martin,

>From what I understand, you want a classifier that:
1. Is not based on imputation
2. Ignores whether a value is missing or not for the inference

It seems to me that those two requirements are in contradiction, and it is not 
clear to me how such a classifier would be theoretically grounded.

Best,

Gaël

On Thu, Mar 02, 2023 at 09:01:45AM +, Martin Gütlein wrote:
> It would already help us, if someone could confirm that this is not possible
> in sci-kit learn, because we are still not entirely sure that we have no
> missed something.?

> Regards,
> Martin

> Am 21.02.2023 15:48 schrieb Martin Gütlein:
> > Hi,

> > I am looking for a classification model in python that can handle
> > missing values, without imputation and "without learning from missing
> > values", i.e. without using the fact that the information is missing
> > for the inference.

> > Explained with the help of decision trees:
> > * The algorithm should NOT learn whether missing values should go to
> > the left or right child (like the HistGradientBoostingClassifier).
> > * Instead it could built the prediction for each child node and
> > aggregate these (like some Random Forest implementations do).

> > If that is not possible in sci-kit learn, maybe you have already
> > discussed this? Or you know of a fork of sci-kit learn that is able to
> > do this, or some other python library?

> > Any help would be really appreciated, kind regards,
> > Martin


> > P.S. Here is my use-case, in case you are interested: I have a binary
> > classification problem with a positive and a negative class, and two
> > types of features A and B. In my training data, I have a lot more data
> > (90%) where B is missing. In my test data, I always have B, which is
> > good because the B features are better than the A features. In the
> > cases where B is present in the training data, the ratio of positive
> > examples is much higher than when its missing. So what
> > HistGradientBoostingClassifier does, it uses the fact that B is not
> > missing in the test data, and predicts way too many positives.
> > (Additionally, some feature values of type A are also often missing)
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] mutual information for continuous variables with scikit-learn

2023-02-01 Thread Gael Varoquaux
For estimating mutual information on continuous variables, have a look at the 
corresponding package
https://pypi.org/project/mutual-info/

G

On Wed, Feb 01, 2023 at 02:32:03PM +0100, m m wrote:
> Hello,

> I have two continuous variables (heart rate samples over a period of time), 
> and
> would like to compute mutual information between them as a measure of
> similarity.

> I've read some posts suggesting to use the mutual_info_score from scikit-learn
> but will this work for continuous variables? One stackoverflow answer 
> suggested
> converting the data into probabilities with np.histogram2d() and passing the
> contingency table to the mutual_info_score.

> from sklearn.metrics import mutual_info_score

> def calc_MI(x, y, bins):
>     c_xy = np.histogram2d(x, y, bins)[0]
>     mi = mutual_info_score(None, None, contingency=c_xy)
>     return mi

> # generate data
> L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> uncorrelated = np.random.standard_normal((2, 300))
> correlated = np.dot(L, uncorrelated)
> A = correlated[0]
> B = correlated[1]
> x = (A - np.mean(A)) / np.std(A)
> y = (B - np.mean(B)) / np.std(B)

> # calculate MI
> mi = calc_MI(x, y, 50)

> Is calc_MI a valid approach? I'm asking because I also read that when 
> variables
> are continuous, then the sums in the formula for discrete data become
> integrals, but I'm not sure if this procedure is implemented in scikit-learn?

> Thanks!

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] scikit-learn 1.1 release

2022-05-12 Thread Gael Varoquaux
Wohoo!! Thank you so much. This is so exciting: all those nice
improvements reaching so much users.

Gaël

On Thu, May 12, 2022 at 05:20:24PM +0200, Jeremie du Boisberranger wrote:
> Hi everyone,

> We're happy to announce the 1.1 release which you can install via pip or
> conda:

>     pip install -U scikit-learn

> or (soon)

>     conda install -c conda-forge scikit-learn

> The wheels for arm64 are not available yet on PyPI though. We'll add them as
> soon as possible.

> You can read the release highlights under 
> https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_1_0.html
> and the long list of the changes under
> https://scikit-learn.org/stable/whats_new/v1.1.html

> This version supports Python versions 3.8 to 3.10.

> A big thanks to all contributors who helped on this release.

> Regards,
> Jérémie,
> On the behalf of the scikit-learn maintainer team.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Scikit-learn got a prize in France

2022-02-05 Thread Gael Varoquaux
Hi everyone,

It has just been announced that scikit-learn has received a prize for
open-source scientific software from the French government:
https://twitter.com/Osec2022/status/1489973348637585416

I knew that we were short-listed, but I only got confirmation that we
would receive it a few minutes ago.

I'm listed on the tweet, but the prize is for the software, and not an
individual, and to the whole community, which I of course mentioned in
the acceptance speech. As far as I know, this is not going to make any
one of us rich, unfortunately: I think that there is no money involved.

Thanks a lot for being part of this adventure,

Gaël

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Upcoming: 2nd edition of "Machine learning with scikit-learn MOOC"

2022-01-23 Thread Gael Varoquaux
Hi everyone,

The team at Inria, with the help of the Inria learning lab, will soon be 
opening the 2nd edition of the "Machine Learning with scikit-learn" MOOC:
https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/

The content of the MOOC is visible here (we are still polishing details, this 
is not final):
https://inria.github.io/scikit-learn-mooc/
As you can see, it touches all the basics of machine learning, introduced with 
scikit-learn, teaching much more than the API of the library.

We have put a lot of effort on being didactic. Anna Kondratenko, one of last 
year's participant, said of last year's edition:
"I did a #ScikitLearnMooc course as part of a #100DaysOfCode challenge and I 
just loved it. Scikit-learn creators managed to make it practice-focused and 
entertaining at the same time. Also, it is perfect for beginners since it 
starts from the basics going to more advanced level."
https://twitter.com/anacoding/status/1484949583629369344
This year's edition should be significantly more didactic!

One of the values of participating to the MOOC, compared to just the material 
that we provide on the web, is that it is full of coding exercise, that are 
meant to teach understanding of machine-learning and coding skills.

The MOOC is absolutely free, and all the materials are open (in the spirit of 
scikit-learn).

While many people on this list may already know the contents of this MOOC 
(though, we have inserted many useful reflections), you might know people who 
could benefit from this course to learn machine learning. Please help us spread 
the word.

Pythonly yours,

Gaël

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] DirtyData and the SuperVectorizer, for non-normalized dataframes

2021-10-13 Thread Gael Varoquaux
Dear scikit-learn community,

I would like to announce a new release of dirty-cat, which strives to
facilitates machine-learning on non-curated categories: robust to
morphological variants, such as typos.

The new big feature, which I think is of interest to many, is the
"SuperVectorizer", that strives to readily vectorize a pandas dataframe:
https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-super-vectorizer

Of course, such an object is full of heuristics. We have tuned them
empirically, but we expect more progress in the long term, as we build a
bigger databases of dataframes that are difficult to vectorize. We'd love
people to join the adventure, it's been fun so far.

Cheers,

Gaël

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [TC Vote] Technical Committee vote: line length

2021-10-05 Thread Gael Varoquaux
Hi everyone,

I left for vacations and forgot this (and did not express my vote).

The TC has had plenty of time to vote, my own vote is in favor of the
consensus in very active developers.

My count of the expressed vote is the following:

- Keep current 88 characters:

Olivier Grisel
Joel Nothman
Gaël Varoquaux

- Revert to 79 characters:

   Alex Gramfort
   Adrin Jalali

- Answer with no preference expressed:

   Roman Yurchak

So the decision is to use 88 chars, which means no action is needed.

Thank you everyone!

Gaël

On Mon, Aug 02, 2021 at 11:15:48AM +0200, Roman Yurchak wrote:
> I also don't have a strong opinion on this, and generally I'm just happy
> that black migration happened.

> Still with a slight preference for 88 characters as the default.

> On 28/07/2021 18:34, Olivier Grisel wrote:
> > Many very active core devs not represented in the TC voted for 88 and
> > my previous vote for 79 was not that strong. So I feel that I should
> > now vote for 88:

> > Keep current 88 characters:

> > Olivier

> > Revert to 79 characters:

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-27 Thread Gael Varoquaux
Thank you to Adrin for stewarding this release, and congratulation to all
the team for merging all the improvements.

Scikit-learn is a foundation of machine learning in Python. Solid
releases and stability over time is a service to the community.

Gaël

On Fri, Sep 24, 2021 at 06:38:40PM +0200, Adrin wrote:
> Hi everyone,

> We're happy to announce the 1.0 release which you can install via pip or 
> conda:

>     pip install -U scikit-learn

> or

>     conda install -c conda-forge scikit-learn

> You can read the release highlights under https://scikit-learn.org/stable/
> auto_examples/release_highlights/plot_release_highlights_1_0_0.html and the
> long list of the changes under https://scikit-learn.org/stable/whats_new/
> v1.0.html

> New major features include: mandatory keyword arguments in many places, Spline
> Transformers, Quantile Regressor, Feature Names Support, a more flexible
> plotting API, Online One-Class SVM, and much more!

> This version supports Python versions 3.7 to 3.9.

> A big thanks to all contributors for making this release possible.

> Regards,
> Adrin,
> On the behalf of the scikit-learn maintainer team.

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] [TC Vote] Technical Committee vote: line length

2021-07-26 Thread Gael Varoquaux
This email is meant for the scikit-learn Technical Committee, and is on
the public mailing list for transparency.

The community has not been able to reach a strong consensus on an
incredibly important decision: line length :)
https://doodle.com/poll/wpp7c8343zy46v93?utm_source=poll_medium=link

So, the TCs need to make a vote. Members of the TC, please vote below by
adding your name:

Keep current 88 characters:

Revert to 79 characters:

As a reminder, the technical committee is made of: Alexandre Gramfort,
Olivier Grisel, Adrin Jalali, Andreas Müller, Joel Nothman, Hanmin Qin,
Gaël Varoquaux, and Roman Yurchak (according to
https://scikit-learn.org/stable/governance.html)

We have one week to vote, but if we do it in less time, no one will
complain.

Thanks!

Gaël

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Is there a model for truncated regression in sklearn?

2021-06-08 Thread Gael Varoquaux
Hi,

Scikit-learn does not cover this problem.

I think that it relates to what is called survival analysis. You'll find
a survival analysis package in Python at
https://lifelines.readthedocs.io/en/latest/

Best,

Gaël

On Tue, Jun 08, 2021 at 04:22:14PM +0900, Francois Berenger wrote:
> Hello,

> https://en.wikipedia.org/wiki/Truncated_regression_model

> Sometimes, data have missing samples when the target variable
> is above or below a threshold value.
> This is very often the case for biochemical data (e.g. target
> variable outside detection range of some lab equipment).

> I highly suspect some specific models could handle such datasets
> better than generic methods (i.e. train better models).

> Some points of entry, if that might help:

> - R has a truncreg package
>   https://cran.r-project.org/web/packages/truncreg/index.html
> - a related paper from the wikipedia page:
>   "Local likelihood estimation of truncated regression and
>   its partial derivatives: Theory and application"
> https://hal.archives-ouvertes.fr/hal-00520650/file/PEER_stage2_10.1016%252Fj.jeconom.2008.08.007.pdf

> I can provide a cleaned public regression dataset, if someone is interested,
> for tests
> (there are many such datasets in ChEMBL and PubChem by the way, but you need
> to know how
> to "featurize"/encode molecules).

> Regards,
> F.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Presented scikit-learn to the French President

2020-12-04 Thread Gael Varoquaux
Hi scikit-learn community,

Today, I presented some efforts in digital health to the French president
and part of the government. As these efforts were partly powered by
scikit-learn (and the whole pydata stack, to be fair), the team in charge
of the event had printed a huge scikit-learn logo behind me:
https://twitter.com/GaelVaroquaux/status/1334959438059462659 (terrible
mobile-phone picture)

I would have liked to get a picture with the president and the logo, but
it seems that they are releasing only a handful of pictures :(. Anyhow... 


Thanks to the community! This is a huge success. For health topics (we
are talking nationwide electronic health records) the ability to build on
an independent open-source stack is extremely important. We, as a wider
community, are building something priceless.

Cheers,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Changes in Travis billing

2020-11-26 Thread Gael Varoquaux
On Thu, Nov 26, 2020 at 02:45:33PM +0100, Adrin wrote:
> At this point I'm at a loss, and reading the NumFocus chat and other
> packages' experience with them on the same topic, seems like we just
> need to move out of Travis.

Agreed. Do we still need them for something essential?

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Changes in Travis billing

2020-11-02 Thread Gael Varoquaux
Travis is changing it's billing strategy:
https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing

Open repositories are getting a free initial set of credit. They invite
open source projects to contact them to benefit from a more liberal
policy.

I suggest that we do the latter, as I fear that we might run out of
credits, and I am quite convinced that we could benefit from the liberal
policy.

Cheers,

Gaël

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] License questions

2020-09-23 Thread Gael Varoquaux
Hi,

Yes you may, if you give credit. The license is BSD 3 clause.

Best,

Gaël

On Wed, Sep 23, 2020 at 02:57:03PM +0900, Hideshi Takami wrote:
> Nice to meet you.
> I have a question.
> I am creating teaching materials on machine learning for commercial purposes.
> Can I use screenshots from the website (https://scikit-learn.org/) for videos
> and presentation materials?  

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] New members of the triage team: Albert, Lucy, & Reshama

2020-09-16 Thread Gael Varoquaux
We are excited to welcome new members to the triage team:

* Albert Thomas https://github.com/albertcthomas
* Lucy Liu https://github.com/lucyleeow
* Reshama Shaikh https://github.com/reshamas

Their thorough work on helping the community is much appreciated.

Thanks!!

Gaël

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fwd: [Numpy-discussion] start of an array (tensor) and dataframe API standardization initiative

2020-08-18 Thread Gael Varoquaux
Yes, I think that I kickstarted this a few month ago:
https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267

I really hope that this will help us serve better the community in
scikit-learn!

G

On Tue, Aug 18, 2020 at 05:11:46PM +0200, Adrin wrote:
> FYI: Related to the data-frame like discussions we've been having.

> -- Forwarded message -
> From: Ralf Gommers 
> Date: Mon., Aug. 17, 2020, 22:35
> Subject: [Numpy-discussion] start of an array (tensor) and dataframe API
> standardization initiative
> To: Discussion of Numerical Python 


> Hi all,

> I'd like to share this announcement blog post about the creation of a
> consortium for array and dataframe API standardization here: https://
> data-apis.org/blog/announcing_the_consortium/. It's still in the beginning
> stages, but starting to take shape. We have participation from one or more
> maintainers of most array and tensor libraries - NumPy, TensorFlow, PyTorch,
> MXNet, Dask, JAX, Xarray. Stephan Hoyer, Travis Oliphant and myself have been
> providing input from a NumPy perspective.

> The effort is very much related to some of the interoperability work we've 
> been
> doing in NumPy (e.g. it could provide an answer to what's described in 
> https://
> numpy.org/neps/nep-0037-array-module.html#
> requesting-restricted-subsets-of-numpy-s-api).

> At this point we're looking for feedback from maintainers at a high level (see
> the blog post for details).

> Also important: the python-record-api tooling and data in its repo has very
> granular API usage data, of the kind we could really use when making decisions
> that impact backwards compatibility.

> Cheers,
> Ralf

> ___
> NumPy-Discussion mailing list
> numpy-discuss...@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] climate friendly software licence

2020-06-29 Thread Gael Varoquaux
Hi Sole,

I personally believe that global warming is the most important threat to
our well-being, together with the rise of fascism.

However, legal matters are seldom easy (IANAL). It is unclear that those
licenses are enforceable. See for instance the discussion from Bruce
Perens, who has a huge amount of experience in open source licensing:
https://perens.com/2019/10/12/invasion-of-the-ethical-licenses/
(credit to Andy Mueller for digging up this reference).

The more common a software license is, the more likely a team is to hold
in court, and the less likely a team is to have legal fees to cover
(which would kill us, as a project).

Best,

Gaël

On Mon, Jun 29, 2020 at 08:06:58AM +, Sole Galli via scikit-learn wrote:
> Hello Scikit-learn team,

> I've come across this:
> https://twitter.com/tristanharris/status/1277136696568508418?s=12



> Basically, it is an initiative to include in software license a prohibition of
> use by fossil fuel extractivist companies. 

> I would like to know your views on this? Is this something that you would pick
> up from Scikit-learn?

> Are there some legal concerns to be aware of? or anything else that should be
> considered?

> Because it sounds quite powerful and straightforward to me.

> I would be really keen to hear from you.

> Thanks a lot

> Sole

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

2020-04-30 Thread Gael Varoquaux
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
> I've used R and Stata software, none needs such transformation. They have a
> data type called "factors", which is different from "numeric".

> My problem with OHE:
> One-hot-encoding results in large number of features. This really blows up
> quickly. And I have to fight curse of dimensionality with PCA reduction. 
> That's
> not cool!

Most statistical models still not one-hot encoding behind the hood. So, R
and stata do it too.

Typically, tree-based models can be adapted to work directly on
categorical data. Ours don't. It's work in progress.

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier/Regressor and Kernel Processes when Multiprocessing

2020-04-28 Thread Gael Varoquaux
Hi,

I cannot look too much in details. However, I would advice you to try
using loky or joblib instead of multiprocessing, as a lot of work has
been put in them to protect against problems that can arise in
multi-process parallel computing (for instance the underlying numerical
libraries may not be fork safe, or they may have parallel computing
abilities themselves).

Hope this helps,

Gaël

On Tue, Apr 28, 2020 at 02:06:00PM -0500, Taylor J Keding wrote:
> Hi SciKit-Learn folks,

> I am building a stacked generalization classifier using the multilayer
> perceptron classifier as one of it's submodels. All data have been 
> preprocessed
> appropriately and I am tuning each submodel's hyperparameters with a 
> customized
> randomized search protocol (very similar to sklearn's RandomizedSearchCV).
> Importantly, I am using Python's Multiprocessing.Pool() to parallelize this
> search.

> When I start the hyperparameter search, jobs/threads do indeed spawn
> appropriately. Tuning other submodels (RandomForestClassifier, SVC,
> GradientBoostingClassifier, SDGClassifier) works perfectly, which each job
> (model with particular randomized parameters) being scored with 
> cross_val_score
> and returning when the Pool of workers is complete. All is well until I reach
> the MLPClassifier model. Jobs spawn as with the other models, however, System
> CPU (Linux Kernel) processes surge and overwhelm my server. Approximately 20%
> of the CPUs are running User processes, while the other 80% of CPUS are 
> running
> System/Kernel processes, causing immense slow-down. Again, this only happens
> with the MLPClassifier - all other models run appropriately with ~98% User
> processes and ~2% System/Kernel processes.

> Is there something unique in the MLPClassifier/Regressor models that causes
> increased System/Kernel processes compared to other models? In an attempt to
> troubleshoot, I used sklearn's RandomizedSearchCV instead of my custom
> implementation and the same problems happen (with n_jobs specified in the same
> way).

> Any help with why the MLP models are behaving this way during multiprocessing
> is much appreciated.
> Best,
> Taylor Keding

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee

2020-04-27 Thread Gael Varoquaux
+1

And thank you very much Adrin!

On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote:
> Hi All.

> Given all his recent contributions, I want to nominate Adrin Jalali to the
> Technical Committee:
> https://scikit-learn.org/stable/governance.html#technical-committee

> According to the governance document, this will require a discussion and
> vote.
> I think we can move to the vote immediately unless someone objects.

> Thanks for all your work Adrin!

> Cheers,
> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] April 27th scikit-learn monthly meeting

2020-04-27 Thread Gael Varoquaux
I seem to be failing to get this to work. Am I the only one?

If not, we'll need a fallback. Any suggestions? We can use
http://meet.jit.si/ or https://whereby.com/ but I don't know if they will
handle the load.

G


On Mon, Apr 27, 2020 at 01:22:32PM +0200, Chiara Marmo wrote:
> Dear all,

> the zoom link used for the core-dev meeting had to be updated.
> The new link follows.

> Join the core-dev Zoom Meeting at
> https://us02web.zoom.us/j/2752786717

> Meeting ID: 275 278 6717

> See you there!

> Best,
> Chiara


> On Fri, Apr 24, 2020 at 12:29 PM Chiara Marmo  wrote:


> Hi all,

> The next scikit-learn monthly meeting will take place on Monday April 27th
> at the usual time: https://www.timeanddate.com/worldclock/
> 
> meetingdetails.html?year=2020=4=27=12=0=0=240=
> 33=37=179=195

> While these meetings are mainly for core-devs to discuss the current
> topics, we're also happy to welcome non-core devs and other projects
> maintainers! Feel free to join, using the following link:

> 
> https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09

> Meeting ID: 943 9938 2811
> Password: 68473658


> If you plan to attend and you would like to discuss something specific
> about your contribution please add your name (or github pseudo) in the "
> Issue and comments from contributors", of the public pad:

> https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA


> @core devs, please make sure to update your notes on Friday.


> Best,

> Chiara


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-28 Thread Gael Varoquaux
Thanks for the link Andy. This is indeed very interesting!

On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
> > Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
> > MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
> > order).

> Maybe LinearRegression docstring should more strongly suggest to use Ridge
> with small regularization in practice.

Yes! I actually wonder if we should not remove LinearRegression. It's a
bit frightening me that so many people use it. The only time that I've
seen it used in a scientific people, it was a mistake and it shouldn't
have been used.

I seldom advocate for deprecating :).

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using a new random number generator in libsvm and liblinear

2020-01-04 Thread Gael Varoquaux
Me neither.

The only drawback that I see is that we have a codebase that is drifting
more and more from upstream. But I think that that ship has sailed.

G

On Sat, Jan 04, 2020 at 01:49:50PM +0100, Alexandre Gramfort wrote:
> I don't foresee any issue with that.

> Alex
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-03 Thread Gael Varoquaux
+1.

Great job!

Gaël

On Tue, Dec 03, 2019 at 05:58:44PM -0500, Nicolas Hug wrote:
> +1

> On 12/3/19 5:40 PM, Adrin wrote:

> +1

> On Tue., Dec. 3, 2019, 23:28 Andreas Mueller,  wrote:

> +1

> On 12/3/19 5:09 PM, Nicolas Hug wrote:


> As per our Governance document, changes to API principles are to 
> be
> established through an Enhancement Proposal (SLEP) from which any
> core developer can call for a vote on its acceptance.


> SLEP010: n_features_in attribute is up for a vote. Please see
> 
> https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
> /slep010/proposal.html

> This SLEP proposes the introduction of a public n_features_in_
> attribute for most estimators

> Core developers are invited to vote on this change until 4 January
> 2020 by replying to this email thread.

> All members of the community are welcome to comment on the 
> proposal
> on this mailing list, or to propose minor changes through Issues
> and Pull Requests at https://github.com/scikit-learn/
> enhancement_proposals/.


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-30 Thread Gael Varoquaux
Sounds good!

As a side note, I hope that the scikit-learn twitter account can be
something where we "ask for forgiveness rather than permission": the
consequences of getting something wrong are lighter than when
incorporating code in the library. Hopefully, this should enables us to
keep the twitter account active while minimizing the amount of time spent
on it.

My 2 cents,

Gaël

On Sat, Nov 30, 2019 at 05:33:18PM -0500, Nicolas Hug wrote:
> Adrin also proposed

> Hi there. We've repurposed this account and it will be used for
> scikit-learn related announcements. To follow day to day progress on the
> repo, please follow @sklearn_commits.

> Both are fine with me.


> For maximum reach, maybe we could:


> 1. tweet the release announcement from @scikit-learn
> 2. directly answer with the tweet indicating that we are re-purposing the
> account
> 3. have everyone retweet the first tweet

> Nicolas


> On 11/25/19 12:23 PM, Olivier Grisel wrote:

> I have created the https://twitter.com/sklearn_commits twitter account.

> I have applied to make this account a "Twitter Developer" account to
> be able to use https://github.com/filearts/tweethook to register it as
> a webhook for the main scikit-learn github repo.

> Once ready, I will remove the old webhook currently registered on
> @scikit_learn account and would like to tweet about the transfer as
> drafted here:

> https://hackmd.io/@4rHCRgfySZSdd5eMtfUJiA/H1CSpuF2S/edit

> Please feel free to let me know if you have any comment / suggestion
> about this plan.



> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-22 Thread Gael Varoquaux
> I would like to create @sklearn_commits instead of
> @scikit_learn_commits that is too long to my taste. Any opinion?

Some people do not make the link between "sklearn" and "scikit-learn" :)

We can address that in the name / bio, though.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-05 Thread Gael Varoquaux
On Mon, Nov 04, 2019 at 10:14:26PM -0700, Andreas Mueller wrote:
> Should we re-purpose the existing twitter account or make a new one?
> https://twitter.com/scikit_learn

I think that we should repurpose it:

- Make a "scikit-learn-commits" twitter account that does what the
  current one does
- Use the current one.

G

> We do have 6k followers already!

> On 11/4/19 3:08 PM, Nelle Varoquaux wrote:

> I think that's a good idea as well!

> On Mon, 4 Nov 2019 at 15:06, Chiara Marmo  wrote:

> Be reassured Gael... no support via twitter... :)
> Just a way to centralize messages and reach people that ping to show
> that scikit-learn cares.

> On Mon, Nov 4, 2019 at 2:04 PM Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:

> On Mon, Nov 04, 2019 at 05:41:31PM +0530, Siddharth Gupta wrote:
> > Would be good for the users to have a social media account to
> reach out to.

> I do not think that the point is to do support, but outreach.

> Gaël



> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-04 Thread Gael Varoquaux
On Mon, Nov 04, 2019 at 05:41:31PM +0530, Siddharth Gupta wrote:
> Would be good for the users to have a social media account to reach out to.

I do not think that the point is to do support, but outreach.

Gaël

> On Mon, 4 Nov 2019, 17:38 Nicolas Hug,  wrote:


> I like the idea as well

> On 11/4/19 5:58 AM, Adrin wrote:

> sounds pretty good to me :)

> On Mon, Nov 4, 2019 at 10:51 AM Chiara Marmo 
> wrote:

> Hello everybody,

> I've taken a look to the last meeting minutes: talking about
> releases and sprint announcements, it seems that the need for a
> centralized communication channel is rising, both from user and 
> dev
> sides.
> What about starting to use the scikit-learn twitter account for
> that?
> This will also help to animate the community, sckit-learn benefits
> of a lot of mentions which are never answered.
> I can help with managing the account if needed.

> WDYT?

> Chiara
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25

2019-10-16 Thread Gael Varoquaux
On Sun, Oct 13, 2019 at 07:40:11PM +0900, Brown J.B. via scikit-learn wrote:
> Please, respect and refinement when addressing the contributors and users of
> scikit-learn.

I believe that Mike simply misread. It's something that happens (it
happens a lot to me).

No harm on my side, and thanks for clarifying my overly short reply.

G

> Gael's statement is perfect -- complexity does not imply better prediction.
> The choice of estimator (and algorithm) depends on the structure of the model
> desired for the data presented.
> Estimator superiority cannot be proven in a context- and/or data-agnostic
> fashion.

> J.B.


> 2019年10月13日(日) 6:13 Mike Smith :

> "Second complexity does not
> > imply better prediction. " 

> Complexity doesn't imply prediction? Perhaps you're having a translation
> error.

> On Sat, Oct 12, 2019 at 2:04 PM  wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn@python.org

> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-requ...@python.org

> You can reach the person managing the list at
>         scikit-learn-ow...@python.org

> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."


> Today's Topics:

>    1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith)


> --

> Message: 1
> Date: Sat, 12 Oct 2019 14:04:12 -0700
> From: Mike Smith 
> To: scikit-learn@python.org
> Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24
> Message-ID:
>          4lry2njvjwvvr4rg...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"

> "...  > If I should expect good results on a pc, scikit says that
> needing
> gpu power is
> > obsolete, since certain scikit models perform better (than ml
> designed
> for gpu)
> > that are not designed for gpu, for that reason. Is this true?"

> Where do you see this written? I think that you are looking for overly
> simple stories that you are not true."

> Gael, see the below from the scikit-learn FAQ. You can also find this
> yourself at the main FAQ:

> [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn
> 0.21.3 documentation.png]


> On Sat, Oct 12, 2019 at 9:03 AM 
> wrote:

> > Send scikit-learn mailing list submissions to
> >         scikit-learn@python.org

> > To subscribe or unsubscribe via the World Wide Web, visit
> >         https://mail.python.org/mailman/listinfo/scikit-learn
> > or, via email, send a message with subject or body 'help' to
> >         scikit-learn-requ...@python.org

> > You can reach the person managing the list at
> >         scikit-learn-ow...@python.org

> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of scikit-learn digest..."


> > Today's Topics:

> >    1. Re: Is scikit-learn implying neural nets are the best
> >       regressor? (Gael Varoquaux)



> --

> > Message: 1
> > Date: Fri, 11 Oct 2019 13:34:33 -0400
> > From: Gael Varoquaux 
> > To: Scikit-learn mailing list 
> > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are
> >         the best regressor?
> > Message-ID: <20191011173433.bbywiqnwjjpvs...@phare.normalesup.org>
> > Content-Type: text/plain; charset=iso-8859-1

> > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote:
> > > In other words, according to that arrangement, is scikit-learn
> implying
> > that
> > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17?

> > No.

> > First they are not ordered in order of complexity (Naive Bayes is
> > arguably simpler than Gaussian Processes). Second complexity does 
> not
> > imply better prediction.

> > > If I should expect good results on a pc, scikit says that needing
>  

Re: [scikit-learn] Is scikit-learn implying neural nets are the best regressor?

2019-10-11 Thread Gael Varoquaux
sing the snippet).
> >>         So apart from this second split, the other differences seems
> >>         to be numerical instability.

> >>         Where I have some concern is regarding the convergence rate
> >>         of SAGA but I have no
> >>         intuition to know if this is normal or not.

> >>         On Wed, 9 Oct 2019 at 23:22, Roman Yurchak
> >>         mailto:rth.yurc...@gmail.com>> wrote:

> >>             Ben,

> >>             I can confirm your results with penalty='none' and C=1e9.
> >>             In both cases,
> >>             you are running a mostly unpenalized logisitic
> >>             regression. Usually
> >>             that's less numerically stable than with a small
> >>             regularization,
> >>             depending on the data collinearity.

> >>             Running that same code with
> >>             ? - larger penalty ( smaller C values)
> >>             ? - or larger number of samples
> >>             ? yields for me the same coefficients (up to some
> tolerance).

> >>             You can also see that SAGA convergence is not good by the
> >>             fact that it
> >>             needs 196000 epochs/iterations to converge.

> >>             Actually, I have often seen convergence issues with SAG
> >>             on small
> >>             datasets (in unit tests), not fully sure why.

> >>             --
> >>             Roman

> >>             On 09/10/2019 22:10, serafim loukas wrote:
> >>             > The predictions across solver are exactly the same when
> >>             I run the code.
> >>             > I am using 0.21.3 version. What is yours?
> >>             >
> >>             >
> >>             > In [13]: import sklearn
> >>             >
> >>             > In [14]: sklearn.__version__
> >>             > Out[14]: '0.21.3'
> >>             >
> >>             >
> >>             > Serafeim
> >>             >
> >>             >
> >>             >
> >>             >> On 9 Oct 2019, at 21:44, Beno?t Presles
> >>              >>             <mailto:benoit.pres...@u-bourgogne.fr>
> >>             >> <mailto:benoit.pres...@u-bourgogne.fr
> >>             <mailto:benoit.pres...@u-bourgogne.fr>>> wrote:
> >>             >>
> >>             >> (y_pred_lbfgs==y_pred_saga).all() == False
> >>             >
> >>             >
> >>             > ___
> >>             > scikit-learn mailing list
> >>             > scikit-learn@python.org <mailto:scikit-learn@python.org>
> >>             > https://mail.python.org/mailman/listinfo/scikit-learn
> >>             >

> >>             ___
> >>             scikit-learn mailing list
> >>             scikit-learn@python.org <mailto:scikit-learn@python.org>
> >>             https://mail.python.org/mailman/listinfo/scikit-learn



> >>         --
> >>         Guillaume Lemaitre
> >>         Scikit-learn @ Inria Foundation
> >>         https://glemaitre.github.io/



> >>     --
> >>     Guillaume Lemaitre
> >>     Scikit-learn @ Inria Foundation
> >>     https://glemaitre.github.io/



> >> --
> >> Guillaume Lemaitre
> >> Scikit-learn @ Inria Foundation
> >> https://glemaitre.github.io/

> >> ___
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn

> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> -- next part --
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191011/
> a7052cd9/attachment-0001.html>

> --

> Subject: Digest Footer

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> --

> End of scikit-learn Digest, Vol 43, Issue 21
> 


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA  Visiting professor, McGill 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-16 Thread Gael Varoquaux
On Mon, Sep 16, 2019 at 11:28:57PM +1000, Joel Nothman wrote:
> That is, we could consider this resolved after 14 votes in favour.

> So far, if I've interpreted correctly:

> +1 (adrin, nicolas, hanmin, joel, guillaume, jeremie, thomas, vlad, roman) = 
> 9.

> I've not understood a clear position from Alex. I'm assuming Andreas is in
> favour given his comments elsewhere, but we've not seen comment here.

I was planning to vote -0 mostly to avoid the vote to seem like bandwagon
(and because I am not fully sold on the idea), but I actually want this
to move forward, and it seems that my vote is needed.

Hence, I vote +1.

Hopefully Andreas and Alex make their position clear and we can adopt the
SLEP.

Thank you to you all.

Gaël

> On Mon, 16 Sep 2019 at 20:06, Roman Yurchak  wrote:

> +1 assuming we are careful about continuing to allow some frequently
> used positional arguments, even in __init__.

> For instance,

> n_components = 10
> pca = PCA(n_components)

> is still more readable, I think, than,

> pca = PCA(n_components=n_components)
-- 
Gael Varoquaux
Research Director, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn website and documentation

2019-08-22 Thread Gael Varoquaux
Hi everyone,

One thing to keep in mind with regards to technical solution is that it is much 
easier if they play well with sphinx. In other words, fully fledged frameworks 
tend to be harder to slot in.

One tool that I really like is pure css, (https://purecss.io/), because it is 
very lightweight  (it's only css, a the name suggests).

Gaël

⁣Sent from my phone. Please forgive typos and briefness.​

On Aug 22, 2019, 16:21, at 16:21, Nicolas Hug  wrote:
>Hi Chiara,
>
>Thanks for giving it a shot! I think we can end-up with a nice result
>with this theme too.
>
>Is this something you'd like to work on more seriously in the future,
>or
>just something to get you started on scikit-learn in general?
>(Basically, should Andy still be looking for a web-designer?)
>
>
>Nicolas
>
>On 8/18/19 9:43 AM, Chiara Marmo wrote:
>> Dear list, dear devs,
>>
>> I have started to look at the scikit-learn documentation: talking
>with
>> developers in July here in Paris, it seemed that you all are
>concerned
>> by a reorganization / relooking of the doc and nothing is better than
>
>> a naïve beginner to check the effectiveness of a doc ... right? ;)
>> ... and diving into the docs allows me to better familiarize with the
>
>> project... :)
>>
>> As CSS and HTML5 are a bit more fun than reStructuredText I've
>started
>> to play with styling together with the sphinx contents ... contents
>> ask for more focusing ... I will be more serious on that starting
>from
>> September ...
>>
>> From the styling point of view, I am a big fan of this one [1] that
>> you probably already know and I tried to apply those amazing styles
>to
>> sphinx documentation.
>> //
>> Then I saw the Andreas tweet [2] ... and decided to stop by to sum
>up.
>>
>> I've prepared a mock-up for the webpage available here [3].
>> It's a standard build of the doc with the editorial [4] styling
>(basic
>> customization needs improvements). The code is available here [5].
>> I've tried not to move too far from the original visual ... because
>> history is important, especially when you have one! :)
>>
>> I've focused on the homepage so don't expect big modifications in the
>
>> doc itself. This is just a proof of concept. If you think I'm on the
>> right track let me know... I will be happy to be useful on that.
>> If not, or you already have someone taking care of that, please let
>me
>> know too and I will find something else to do.
>>
>> Thanks for reading me.
>>
>> Best,
>>
>> Chiara
>>
>> [1] https://html5up.net/
>> [2] https://twitter.com/amuellerml/status/1161298913841885184
>> [3] https://cmarmo.github.io/mockup-skl/
>> [4] https://html5up.net/editorial
>> [5] https://github.com/cmarmo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] cross-validated MANOVA

2019-08-01 Thread Gael Varoquaux
You should ask the authors of the paper.

Best,

Gaël

On Wed, Jul 31, 2019 at 09:22:10AM +0800, charujing123 wrote:
> Dear experts and users,
> Does anyone know how to perform cross-validated multivariate analysis of
> variance? This is the paper mentioned this method "Searchlight-based
> multi-voxel pattern analysis of fMRI by cross-validated MANOVA".
> Thanks.
> Rujing

> 2019-07-31
> ━━━
> charujing123

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Research Director, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Semi-supervised learning contributions

2019-06-18 Thread Gael Varoquaux
Hi Jonathan,

This is very interesting. However, the bar in terms of quality and scope
of scikit-learn is very high. The best way to move forward is to build a
package outside of scikit-learn, possibly in scikit-learn-contrib, and
maybe in the longer run, consider contributing some methods to
scikit-learn.

In terms of what algorithms and method can go in scikit-learn, the
criteria are written here:
https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
Keep in mind that new methods cannot go in: scikit-learn is only for very
established work.

Cheers,

Gaël

On Tue, Jun 18, 2019 at 02:43:00PM +0200, Jonatan Gøttcke wrote:
> Hi Scikit-Learn developers.

> I’m a masters student in Data Mining and Machine Learning from Denmark. I’m
> finishing my masters in two months time.

> As a part of my thesis I’m comparing a theoretically sound and interesting
> semi-supervised algorithm I came up with, to a bunch of other algorithms. The
> problem was, that all of these basic graph based algorithms didn’t exist in
> scikit-learn, so I’ve implemented them following a scikit-learn like
> ”methodology”, but it’s not 100% compatible. Compatibility will require a bit
> more work, but dependant on how much it is (I don’t know cause I’ve never
> contributed to scikit-learn before), I might have time to put that in before
> the deadline (or I could do it as a part of my PhD, and get that as an easy
> first publication). I’m sure you are more interested in getting the basic 
> graph
> algorithms in there, than my own interesting (and still unpublished) methods  
> 
> .


> I would however love to hear if you guys are interested in getting this
> contribution into scikit-learn.?



> Cheers
> Jonatan M. Gøttcke

> CEO @ OpGo
> +45 23 65 01 96




> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Highly cited paper - causal random forests

2019-05-25 Thread Gael Varoquaux
Causal forest are a very nice work. However, they deal with causal
inference, rather than prediction. Hence, I am not really sure how we
could implement them in the API of scikit-learn. Do you have a
suggestion?

Cheers,

Gaël


On Fri, May 24, 2019 at 05:21:50PM -0400, Randy Ellis wrote:
> Would this be difficult for a moderate user to implement in sklearn by
> modifying the existing code base?

> Estimation and Inference of Heterogeneous Treatment Effects using Random
> Forests

> 342 citations in less than a year (Google Scholar): https://
> amstat.tandfonline.com/doi/full/10.1080/01621459.2017.1319839

> "In this article, we develop a nonparametric causal forest for estimating
> heterogeneous treatment effects that extends Breiman’s widely used random
> forest algorithm. In the potential outcomes framework with unconfoundedness, 
> we
> show that causal forests are pointwise consistent for the true treatment 
> effect
> and have an asymptotically Gaussian and centered sampling distribution. We 
> also
> discuss a practical method for constructing asymptotic confidence intervals 
> for
> the true treatment effect that are centered at the causal forest estimates. 
> Our
> theoretical results rely on a generic Gaussian theory for a large family of
> random forest algorithms. To our knowledge, this is the first set of results
> that allows any type of random forest, including classification and regression
> forests, to be used for provably valid statistical inference. In experiments,
> we find causal forests to be substantially more powerful than classical 
> methods
> based on nearest-neighbor matching, especially in the presence of irrelevant
> covariates."
-- 
Gael Varoquaux
Senior Researcher, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN: scikit-learn 0.21 released

2019-05-16 Thread Gael Varoquaux
Indeed!

Great improvements. And it's a pleasure to see that the releases are more
frequent: a huge value to the community.

Gaël

On Thu, May 16, 2019 at 10:21:09AM +0200, bertrand.thirion wrote:
> Congratulations !
> Bertrand 



> Envoyé depuis mon smartphone Samsung Galaxy.

>  Message d'origine 
> De : Joel Nothman 
> Date : 16/05/2019 10:03 (GMT+01:00)
> À : Scikit-learn user and developer mailing list 
> Objet : [scikit-learn] ANN: scikit-learn 0.21 released

> Thanks to the work of many, many contributors, we have released Scikit-learn
> 0.21. It is available from GitHub, PyPI and Conda-forge, but is not yet
> available on the Anaconda defaults channel.

> * Documentation at https://scikit-learn.org/0.21
> * Release Notes at https://scikit-learn.org/0.21/whats_new
> * Download source or wheels at https://pypi.org/project/scikit-learn/0.21rc2/
> * Install from conda-forge with `conda install -c conda-forge scikit-learn`

> Highlights include:
> * neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
> which learns a weighted euclidean distance for k-nearest neighbors. https://
> scikit-learn.org/0.21/modules/neighbors.html#nca
> * ensemble.HistGradientBoostingClassifier
> and ensemble.HistGradientBoostingRegressor: experimental implementations of
> efficient binned gradient boosting machines. https://scikit-learn.org/0.21/
> modules/ensemble.html#gradient-tree-boosting
> * impute.IterativeImputer: an experimental API for a non-trivial approach to
> missing value imputation. https://scikit-learn.org/0.21/modules/impute.html#
> multivariate-feature-imputation
> * cluster.OPTICS: a new density-based clustering algorithm. https://
> scikit-learn.org/0.21/modules/clustering.html#optics
> * better printing of estimators as strings, with an option to hide default
> parameters for compactness: https://scikit-learn.org/0.21/auto_examples/
> plot_changed_only_pprint_parameter.html
> * for estimator and library developers: a way to tag your estimator so that it
> can be treated appropriately with check_estimator. https://scikit-learn.org/
> 0.21/developers/contributing.html#estimator-tags

> There are many other enhancements and fixes listed in the release notes 
> (https:
> //scikit-learn.org/0.21/whats_new).

> Please note that Scikit-learn has new dependencies. It requires:
> * joblib >= 0.11, which used to be vendored within Scikit-learn
> * OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the code is
> compiled (and cythonized)
> * Python >= 3.5. Installing Scikit-learn from Python 2 will continue to 
> provide
> version 0.20.

> Thanks again to everyone who contributed and to our sponsors, who helped us to
> develop such a great set of features and fixes since version 0.20 in under 8
> months.

> Happy Learning!

> From the Scikit-learn ]team.

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Release Candidate for Scikit-learn 0.21

2019-05-02 Thread Gael Varoquaux
Thank you all and congratulations indeed.

Because this release comes soon after the latest one from the 0.20
series, we might have thought that it would be a light one. But no!
Plenty of exciting features!

Gaël

On Wed, May 01, 2019 at 10:13:02PM -0400, Andreas Mueller wrote:
> Thank you for all the amazing work y'all!


> On 4/30/19 10:09 PM, Joel Nothman wrote:

> PyPI now has source and binary releases for Scikit-learn 0.21rc2.

> * Documentation at https://scikit-learn.org/0.21
> * Release Notes at https://scikit-learn.org/0.21/whats_new
> * Download source or wheels at https://pypi.org/project/scikit-learn/
> 0.21rc2/

> Please try out the software and help us edit the release notes before a
> final release.

> Highlights include:
> * neighbors.NeighborhoodComponentsAnalysis for supervised metric learning,
> which learns a weighted euclidean distance for k-nearest neighbors. 
> https:/
> /scikit-learn.org/0.21/modules/neighbors.html#nca
> * ensemble.HistGradientBoostingClassifier
> and ensemble.HistGradientBoostingRegressor: experimental implementations 
> of
> efficient binned gradient boosting machines. 
> https://scikit-learn.org/0.21/
> modules/ensemble.html#gradient-tree-boosting
> * impute.IterativeImputer: a non-trivial approach to missing value
> imputation. https://scikit-learn.org/0.21/modules/impute.html#
> multivariate-feature-imputation
> * cluster.OPTICS: a new density-based clustering algorithm. https://
> scikit-learn.org/0.21/modules/clustering.html#optics
> * better printing of estimators as strings, with an option to hide default
> parameters for compactness: https://scikit-learn.org/0.21/auto_examples/
> plot_changed_only_pprint_parameter.html
> * for estimator and library developers: a way to tag your estimator so 
> that
> it can be treated appropriately with check_estimator. https://
> scikit-learn.org/0.21/developers/contributing.html#estimator-tags

> There are many other enhancements and fixes listed in the release notes (
> https://scikit-learn.org/0.21/whats_new).

> Please note that Scikit-learn has new dependencies:
> * joblib >= 0.11, which used to be vendored within Scikit-learn
> * OpenMP, unless the environment variable SKLEARN_NO_OPENMP=1 when the 
> code
> is compiled (and cythonized)

> Happy Learning!

> From the Scikit-learn core dev team.


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



> _______
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA 
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any other clustering algo cluster incrementally?

2019-04-30 Thread Gael Varoquaux
On Tue, Apr 30, 2019 at 04:48:09PM +0800, lampahome wrote:
> I read this :  https://scikit-learn.org/0.15/modules/scaling_strategies.html

> There's only one clustering algo cluster incrementally, that is minibatch
> kmeans.

The documentation that you are pointing to refers to version 0.15. If you
look at the current page on scaling, you will see that there is another
clustering algorithm that works incrementally:
https://scikit-learn.org/stable/modules/computing.html#strategies-to-scale-computationally-bigger-data

Best,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Gael Varoquaux
On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote:
> If the loss decomposes, the result might be different b/c of different test
> set sizes, but I'm not sure if they are "worse" in some way?

Mathematically, a cross-validation estimates a double expectation: one
expectation on the model (ie the train data), and another on the test
data (see for instance eq 3 in
https://europepmc.org/articles/pmc5441396, sorry for the self citation,
this is seldom discussed in the literature).

The correct way to compute this double expectation is by averaging first
inside the fold and second across the folds. Other ways of computing
errors estimate other quantities, that are harder to study mathematically
and not comparable to objects studied in the literature.

Another problem with cross_val_predict is that some people use metrics
like correlation (which is a terrible metric and does not decompose
across folds). It will then pick up things like correlations across
folds.

All these problems are made worse when data are not iid, and hence folds
risk not being iid.

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Consultation: eligibility of inclusion of speed-up?

2019-02-26 Thread Gael Varoquaux
I need core devs opinion (please, only core devs, I am sending this on
the public ML for transparency):

The following PR adds a speed up for expansion of polynomial kernels:
https://github.com/scikit-learn/scikit-learn/pull/13003

According to the author, the speed up is significant (needs to be
verified during a code review).

The paper is a bit below citation level for inclusion of a new method,
however this can be seen as a speed up of Nystrom. Strictly speaking, it
is not just a speed-up, as it introduces a new estimator.

The discussion on the PR is short and quickly reviews the relevant
literature.


My question: should we consider this as acceptable for inclusion
(provided that it does show significant speedups with good prediction
accuracy)? I am asking to know if we start the review and inclusion
process or not.


Cheers,


Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Sprint discussion points?

2019-02-20 Thread Gael Varoquaux
On Tue, Feb 19, 2019 at 06:16:20PM -0500, Andreas Mueller wrote:
> I put a draft schedule here:
> https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule

I'd like to discuss sample_props. They are important to me.

Should I add them somewhere on the schedule? Maybe in a place where
people who care about them (AFAIK Joel and Alex also do) are available?

Gaël

> it's obviously somewhat opinionated ;)
> Happy to reprioritize.
> Basically I wouldn't like to miss any of the big API discussions because 
> coming
> late to the party.

> The two things on (grid?) searches are somewhat related: one is about
> specifying search-spaces, the other about executing a given search space
> efficiently. They probably warrant separate discussions.

> I haven't added plotting or sample props on it, which are maybe two other
> discussion points.
> I tried to cover most controversial things from the roadmap.

> Not sure if discussing the schedule via the mailing list is the best way? 
> Don't
> want to create even more traffic  than I already am ;)

> On 2/19/19 5:48 PM, Guillaume Lemaître wrote:

> > Not sure if Guillaume had ideas about the schedule, given that he seems
> to be running the show?

> Mostly running behind the show ...

> For the moment, we only have a 30 minutes presentation of introduction
> planned on Monday.
> For the rest of the week, this is pretty much opened and I think that we
> can propose a schedule such that we can be efficient.
> IMO, two meetings of an hour per day look good to me.

> Shall we prioritize the list of the issues? Maybe that some issues could 
> be
> packed together.
> I would not be against having a rough schedule on the wiki directly and I
> think that having it before Monday could be better.

> Let me know how I can help.

> On Tue, 19 Feb 2019 at 22:23, Andreas Mueller  wrote:

> Yeah, sounds good.
> I didn't want to unilaterally post a schedule, but doing some google
> form or similar seems a bit heavy-handed?
> Not sure if Guillaume had ideas about the schedule, given that he 
> seems
> to be running the show?

> On 2/19/19 4:17 PM, Joel Nothman wrote:

> I don't think optics requires a large meeting, just a few people.

> I'm happy with your proposal generally, Andy. Do we schedule
> specific topics at this point?


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fw: Inclusion of an LSTM Classifier

2019-02-17 Thread Gael Varoquaux
Hi,

Thank you for the suggestion.

Such an approach is a deep-learning approach, and is out-of-scope for
scikit-learn:
https://scikit-learn.org/stable/faq.html#why-is-there-no-support-for-deep-or-reinforcement-learning-will-there-be-support-for-deep-or-reinforcement-learning-in-scikit-learn

Best,

Gaël

On Sun, Feb 17, 2019 at 09:18:04AM +, Yash Raj Rai wrote:


> ━━━
> From: Yash Raj Rai
> Sent: Sunday, February 17, 2019 2:34 PM
> To: scikit-learn@python.org
> Subject: Inclusion of an LSTM Classifier


> Hello

> I wanted to know if there are any on-going projects on LSTM Classifier model 
> in
> sklearn. If no, is there any possibility of its inclusion in the library?

> Is there anything beyond the contributor's guidelines that I need to know for
> the introduction of a new model?

> Than you.



> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-11 Thread Gael Varoquaux
; scikit-learn mailing list
>     scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Non-core developers at the sprint

2019-01-16 Thread Gael Varoquaux
Dear users and developers,

We have a sprint coming up in Paris Feb 25th to March 1st:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events

Looking at the list of people who are coming, I am noticing that we have
mostly core developers. While the priority of the sprint is to work on
the big picture rather than onboarding, I am worried that there might be
some self-selection happening. I am sure that some excellent people, who
are contributors yet not core contributors could come.

I would like to invite people who already have contributed and want to
get more involved in the project to contact us to join the sprint.
Specifically, we are willing to fund accommodation and travel for one or
two participants.

Please send a short message to Guillaume Lemaître
 and myself presenting what you have
contributed and what you would like to contribute, as well as your
funding needs. We will curate this list and core contributors will settle
on who we can accommodate.

Cheers,

Gaël

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2019-01-10 Thread Gael Varoquaux
On Thu, Jan 10, 2019 at 12:54:09PM -0500, Andreas Mueller wrote:
> Do you or anyone in your team has cycles to do that?

I asked Guillaume Lemaître to do it. He has started.

Gaël

> I certainly don't, but I could try to delegate (to the single person I
> delegate everything to ;)

> On 1/10/19 12:36 PM, Gael Varoquaux wrote:
> > On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote:
> > > Any sprint specific funding you're thinking of? Google gave in the past, 
> > > right?
> > I was thinking of PSF.

> > Gaël
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2019-01-10 Thread Gael Varoquaux
On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote:
> Any sprint specific funding you're thinking of? Google gave in the past, 
> right?

I was thinking of PSF.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2019-01-10 Thread Gael Varoquaux
On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote:
> Gaël, does the foundation have funds and do you want to use them?
> And/or do you/INRA have funds you want to use?

Neither myself nor Inria has fund to use outside the foundation. The
foundation can commit money if needed. We tend to prefer spending it on
paying senior people to work on the project, as it is the bottleneck (we
are still recruiting, by the way), but such a sprint is important.

We will also apply for sprint-specific funding sources. If we can
lighten-up your budget, so that you can pay awesome people to work on the
project, it is a good thing.

Gaël



> On 1/7/19 4:38 PM, Gael Varoquaux wrote:
> > Hi everybody and happy new year,

> > We let this thread about the sprint die. I hope that this didn't change
> > people's plans.

> > So, it seems that the week of Feb 25th is a good week. I'll assume that
> > it's good for most and start planning from there (if it's not the case,
> > let me know).

> > I've started our classic sprint-planing wiki page:
> > https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events
> > It's not rocket science, but it's better than an email thread to keep
> > information together.

> > It would be great if people could add their name, and if they need
> > funding. We need to evaluate if we need to find funding.

> > Also, it's quite soon, so maybe it would be good to start planning
> > accommodation and travel :$.

> > Cheers,

> > Gaël

> > On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lemaître wrote:
> > > Works for me as well.
> > > Sent from my phone - sorry to be brief and potential misspell.

> > >    Original Message
> > > From: scikit-learn@python.org
> > > Sent: 22 December 2018 17:17
> > > To: scikit-learn@python.org
> > > Reply to: rth.yurc...@pm.me; scikit-learn@python.org
> > > Cc: rth.yurc...@pm.me
> > > Subject: Re: [scikit-learn] Next Sprint
> > > That works for me as well.
> > > On 21/12/2018 16:00, Olivier Grisel wrote:
> > > > Ok for me. The last 3 weeks of February are fine for me.
> > > > Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort
> > > > mailto:alexandre.gramf...@inria.fr>> a 
> > > > écrit :
> > > >   ok for me
> > > >   Alex
> > > >   On Thu, Dec 20, 2018 at 8:35 PM Adrin  > > >   <mailto:adrin.jal...@gmail.com>> wrote:
> > > >    >
> > > >    > It'll be the least favourable week of February for me, but I 
> > > > can
> > > >   make do.
> > > >    >
> > > >    > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller  > > >   <mailto:t3k...@gmail.com>> wrote:
> > > >    >>
> > > >    >> Works for me!
> > > >    >>
> > > >    >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> > > >        >> > I would propose  the week of Feb 25th, as I heard people say
> > > >   that they
> > > >    >> > might be available at this time. It is good for many people,
> > > >   or should we
> > > >    >> > organize a doodle?
> > > >    >> >
> > > >    >> > G
> > > >    >> >
> > > >    >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller 
> > > > wrote:
> > > >    >> >> Can we please nail down dates for a sprint?
> > > >    >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> > > >    >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel 
> > > > wrote:
> > > >    >> >>>> We can also do Paris in April / May or June if that's ok
> > > >   with Joel and better
> > > >    >> >>>> for Andreas.
> > > >    >> >>> Absolutely.
> > > >    >> >>> My thoughts here are that I want to minimize 
> > > > transportation,
> > > >   partly
> > > >    >> >>> because flying has a large carbon footprint. Also, for
> > > >   personal reasons,
> > > >    >> >>> I am not sure that I will be able to make it to Austin in
> > > >   July, but I
> > > >    >> >>> realize that this is a pretty bad argument.
> 

Re: [scikit-learn] Next Sprint

2019-01-07 Thread Gael Varoquaux
Hi everybody and happy new year,

We let this thread about the sprint die. I hope that this didn't change
people's plans.

So, it seems that the week of Feb 25th is a good week. I'll assume that
it's good for most and start planning from there (if it's not the case,
let me know).

I've started our classic sprint-planing wiki page:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events 
It's not rocket science, but it's better than an email thread to keep
information together.

It would be great if people could add their name, and if they need
funding. We need to evaluate if we need to find funding.

Also, it's quite soon, so maybe it would be good to start planning
accommodation and travel :$.

Cheers,

Gaël

On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lemaître wrote:
> Works for me as well. 

> Sent from my phone - sorry to be brief and potential misspell.


>   Original Message  
> From: scikit-learn@python.org
> Sent: 22 December 2018 17:17
> To: scikit-learn@python.org
> Reply to: rth.yurc...@pm.me; scikit-learn@python.org
> Cc: rth.yurc...@pm.me
> Subject: Re: [scikit-learn] Next Sprint

> That works for me as well.

> On 21/12/2018 16:00, Olivier Grisel wrote:
> > Ok for me. The last 3 weeks of February are fine for me.

> > Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort 
> > mailto:alexandre.gramf...@inria.fr>> a écrit :

> > ok for me

> > Alex

> > On Thu, Dec 20, 2018 at 8:35 PM Adrin  > <mailto:adrin.jal...@gmail.com>> wrote:
> >  >
> >  > It'll be the least favourable week of February for me, but I can
> > make do.
> >  >
> >  > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller  > <mailto:t3k...@gmail.com>> wrote:
> >  >>
> >  >> Works for me!
> >  >>
> >  >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> >  >> > I would propose  the week of Feb 25th, as I heard people say
> > that they
> >  >> > might be available at this time. It is good for many people,
> > or should we
> >  >> > organize a doodle?
> >  >> >
> >  >> > G
> >  >> >
> >  >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> >  >> >> Can we please nail down dates for a sprint?
> >  >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> >  >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> >  >> >>>> We can also do Paris in April / May or June if that's ok
> > with Joel and better
> >  >> >>>> for Andreas.
> >  >> >>> Absolutely.
> >  >> >>> My thoughts here are that I want to minimize transportation,
> > partly
> >  >> >>> because flying has a large carbon footprint. Also, for
> > personal reasons,
> >  >> >>> I am not sure that I will be able to make it to Austin in
> > July, but I
> >  >> >>> realize that this is a pretty bad argument.
> >  >> >>> We're happy to try to host in Paris whenever it's most
> > convenient and to
> >  >> >>> try to help with travel for those not in Paris.
> >  >> >>> Gaël
> >  >> >>> ___
> >  >> >>> scikit-learn mailing list
> >  >> >>> scikit-learn@python.org <mailto:scikit-learn@python.org>
> >  >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >  >> >> ___
> >  >> >> scikit-learn mailing list
> >  >> >> scikit-learn@python.org <mailto:scikit-learn@python.org>
> >  >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >  >>
> >  >> ___
> >  >> scikit-learn mailing list
> >  >> scikit-learn@python.org <mailto:scikit-learn@python.org>
> >  >> https://mail.python.org/mailman/listinfo/scikit-learn
> >  >
> >  > ___
> >  > scikit-learn mailing list
> >  > scikit-learn@python.org <mailto:scikit-learn@python.org>
> >  > https://mail.python.org/mailman/listinfo/scikit-learn
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org <mailto:scikit-learn@python.org>
> > https://mail.python.org/mailman/listinfo/scikit-learn


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than that on macOS?

2018-12-15 Thread Gael Varoquaux
I suspect that it is probably due to the linear-algebra libraries: your
scientific Python install on macOS is probably using optimized
linear-algebra (ie optimized numpy and scipy), but not your install on
Windows.

I would recommend you to look at how you installed you Python
distribution on macOS and on Windows, as you likely have installed an
optimized one on one of the platforms and not on the other.

Cheers,

Gaël

On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote:
> Hi Hi everyone,

> I am writing a scikit-learn program to use MLPClassifier to learn
> Fashion-MNIST.
> The following is the program. It's very simple.
> When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it took
> about 4 minutes.
> However, when I ran it on MacBook(macOS), it took about 1 minutes.
> Does anyone help me to understand the reason why Windows 10 is so slow?
> Am I missing something?

> Thanks,  

> import os import gzip import numpy as np #from https://github.com/
> zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def load_mnist
> (path, kind='train'): labels_path = 
> os.path.join(path,'%s-labels-idx1-ubyte.gz'
> % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % kind) with
> gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(lbpath.read(),
> dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath: images
> = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
> images.reshape(len(labels), 784) return images, labels x_train, y_train =
> load_mnist('data', kind='train') x_test, y_test = load_mnist('data', kind=
> 't10k') from sklearn.neural_network import MLPClassifier import time import
> datetime print(datetime.datetime.today()) start = time.time() mlp =
> MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ 60)


> ---
> MATSUDA, Kouichi, Ph.D.

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Agglomerative clustering

2018-12-09 Thread Gael Varoquaux
> I want to impose an additional constraint. When 2 clusters are combined and 
> the
> cost of combination is equal for multiple cluster pairs, I want to choose the
> pair for which the combined cluster has the least size.

> What is the cleanest and easiest way of achieving this?

I don't think that the public API enables you to do that. So I think that
you are going to have to modify the code, and modify the cost heapq to
make it a tuple of "(distance, size)".

Unfortunately, when doing this, you'll be on your own, as we cannot
provide support for modified code.

Cheers,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core dev: Adrin Jalali

2018-12-08 Thread Gael Varoquaux
Indeed, welcome Adrin, and thanks a lot for your investment on the
package!

Gaël

On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote:
> Congratulations and welcome Adrin!

> On 12/5/18 5:32 PM, Joel Nothman wrote:

> The Scikit-learn core development team has welcomed a new member, Adrin
> Jalali, who has been doing some really amazing work in contributing code
> and reviews since July (aside from occasional contributions since 2014).
> Congratulations and welcome, Adrin!


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

2018-11-23 Thread Gael Varoquaux
On Wed, Nov 21, 2018 at 11:35:11AM -0500, Andreas Mueller wrote:
> The question for this particular issue for me is also "what are good
> benchmark datasets".
> In dirty cat you used dirty categories, which is a subset of all
> high-cardinality categorical
> variables.
> Whether "clean" high cardinality variables like zip-codes or dirty ones are
> the better
> benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets
> for either :-/

Fair point. We'll have a look to see what we can find. We're open to
suggestions, from you or from anyone else.

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Gael Varoquaux
On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote:
> The PR is over a year old already, and you hadn't voiced any opposition
> there.

My bad, sorry. Given the name, I had not guessed the link between the PR
and encoding of categorical features. I find myself very much in
agreement with the original issue and its discussion:
https://github.com/scikit-learn/scikit-learn/issues/5853 concerns about
the name and importance of at least considering prior smoothing. I do not
see these reflected in the PR.


In general, the fact that there is not much literature on this implies
that we should be benchmarking our choices. The more I understand kaggle,
the less I think that we can fully use it as an inclusion argument:
people do transforms that end up to be very specific to one challenge. On
the specific problem of categorical encoding, we've tried to do
systematic analysis of some of these, and were not very successful
empirically (eg hashing encoding). This is not at all a vote against
target encoding, which our benchmarks showed was very useful, but just a
push for benchmarking PRs, in particular when they do not correspond to
well cited work (which is our standard inclusion criterion).

Joris has just accepted to help with benchmarking. We can have
preliminary results earlier. The question really is: out of the different
variants that exist, which one should we choose. I think that it is a
legitimate question that arises on many of our PRs.

But in general, I don't think that we should rush things because of
deadlines. Consequences of a rush are that we need to change things after
merge, which is more work. I know that it is slow, but we are quite a
central package.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:

> On 11/20/18 4:43 PM, Gael Varoquaux wrote:
> > We are planning to do heavy benchmarking of those strategies, to figure
> > out tradeoff. But we won't get to it before February, I am afraid.
> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder

I'd rather not. Or rather, I'd rather have some benchmarks on it (it
doesn't have to be us that does it).

> I would really like to add it before February

A few month to get it right is not that bad, is it?

> and it's pretty established.

Are there good references studying it? If they is a clear track of study,
it falls in the usual rules, and should go in.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote:
> I would love to see the TargetEncoder ported to scikit-learn.
> The CountFeaturizer is pretty stalled:
> https://github.com/scikit-learn/scikit-learn/pull/9614

So would I. But there are several ways of doing it:

- the naive way is not the right one: just computing the average of y
  for each category leads to overfitting quite fast

- it can be done cross-validated, splitting the train data, in a
  "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53)

- it can be done using empirical-Bayes shrinkage, which is what we
  currently do in dirty_cat.

We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

> Have you benchmarked the other encoders in the category_encoding lib?
> I would be really curious to know when/how they help.

We did (part of the results are in the publication), and we didn't
have great success.

Gaël

> On 11/20/18 3:58 PM, Gael Varoquaux wrote:
> > Hi scikit-learn friends,

> > As you might have seen on twitter, my lab -with a few friends- has
> > embarked on research to ease machine on "dirty data". We are
> > experimenting on new encoding methods for non-curated string categories.
> > For this, we are developing a small software project called "dirty_cat":
> > https://dirty-cat.github.io/stable/

> > dirty_cat is a test bed for new ideas of "dirty categories". It is a
> > research project, though we still try to do decent software engineering
> > :). Rather than contributing to existing codebases (as the great
> > categorical-encoding project in scikit-learn-contrib), we spanned it out
> > in a separate software project to have the freedom to try out ideas that
> > we might give up after gaining insight.

> > We hope that it is a useful tool: if you have non-curated string
> > categories, please give it a try. Understanding what works and what does
> > not is important to know what to consolidate. Hopefully one day we can
> > develop a tool that is of wide-enough interest that it can go in
> > scikit-learn-contrib, or maybe even scikit-learn.

> > Also, if you have suggestions of publicly available databases that we try
> > it upon, we would love to hear from you.

> > Cheers,

> > Gaël

> > PS: if you want to work on dirty-data problems in Paris as a post-doc or
> > an engineer, send me a line
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
Hi scikit-learn friends,

As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
https://dirty-cat.github.io/stable/

dirty_cat is a test bed for new ideas of "dirty categories". It is a
research project, though we still try to do decent software engineering
:). Rather than contributing to existing codebases (as the great
categorical-encoding project in scikit-learn-contrib), we spanned it out
in a separate software project to have the freedom to try out ideas that
we might give up after gaining insight.

We hope that it is a useful tool: if you have non-curated string
categories, please give it a try. Understanding what works and what does
not is important to know what to consolidate. Hopefully one day we can
develop a tool that is of wide-enough interest that it can go in
scikit-learn-contrib, or maybe even scikit-learn.

Also, if you have suggestions of publicly available databases that we try
it upon, we would love to hear from you.

Cheers,

Gaël

PS: if you want to work on dirty-data problems in Paris as a post-doc or
an engineer, send me a line
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> We can also do Paris in April / May or June if that's ok with Joel and better
> for Andreas.

Absolutely.

My thoughts here are that I want to minimize transportation, partly
because flying has a large carbon footprint. Also, for personal reasons,
I am not sure that I will be able to make it to Austin in July, but I
realize that this is a pretty bad argument.

We're happy to try to host in Paris whenever it's most convenient and to
try to help with travel for those not in Paris.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-11-15 Thread Gael Varoquaux
On Thu, Nov 15, 2018 at 09:14:30AM -0500, Andreas Mueller wrote:
> Are there any plans for a next sprint (in Paris/Europe?)?

We're happy to host one in Paris, ideally in second half of February.

Gaël

> OpenML would like to join us next time and I think that would be cool.
> But they (and I ;) need some advance planning.

> I also have some funding that I could use for this as well.

> Cheers,

> Andy

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Gael Varoquaux
On Thu, Nov 15, 2018 at 08:59:08AM -0500, Andreas Mueller wrote:
> I could try to see if people use positional arguments and where. No promise on
> timeline though.

If someone, you or someone else, does that, it would be very useful.

> I think there is little harm in doing it for new parameters while we figure
> this out, though?

Totally!

Gaël


> On Thu, 15 Nov 2018 at 20:34, Gael Varoquaux 
>  > wrote:

> I am really in favor of the general idea: it is much better to use
> named
> arguments for everybody (for readability, and to be less depend on
> parameter ordering).

> However, I would maintain that we need to move slowly with backward
> compatibility: changing in a backward-incompatible way a library 
> brings
> much more loss than benefit to our users.

> So +1 for enforcing the change on all new arguments, but -1 for
> changing
> orders in the existing arguments any time soon.

> I agree that it would be good to push this change in existing models.
> We
> should probably announce it strongly well in advance, make sure that
> all
> our examples are changed (people copy-paste), wait a lot, and find a
> moment to squeeze this in.

> Gaël

> On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote:
> > We could just announce that we will be making this a syntactic
> constraint from
> > version X and make the change wholesale then. It would be less 
> formal
> backwards
> > compatibility than we usually hold by, but we already are loose with
> parameter
> > ordering when adding new ones.

> > It would be great if after this change we could then reorder
> parameters to make
> > some sense!

> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Gael Varoquaux
I am really in favor of the general idea: it is much better to use named
arguments for everybody (for readability, and to be less depend on
parameter ordering).

However, I would maintain that we need to move slowly with backward
compatibility: changing in a backward-incompatible way a library brings
much more loss than benefit to our users.

So +1 for enforcing the change on all new arguments, but -1 for changing
orders in the existing arguments any time soon.

I agree that it would be good to push this change in existing models. We
should probably announce it strongly well in advance, make sure that all
our examples are changed (people copy-paste), wait a lot, and find a
moment to squeeze this in.

Gaël

On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote:
> We could just announce that we will be making this a syntactic constraint from
> version X and make the change wholesale then. It would be less formal 
> backwards
> compatibility than we usually hold by, but we already are loose with parameter
> ordering when adding new ones.

> It would be great if after this change we could then reorder parameters to 
> make
> some sense!

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Error with Kfold cross vailidation

2018-10-24 Thread Gael Varoquaux
>   kf = KFold(data.shape[0], n_splits=5)
> TypeError: __init__() got multiple values for argument 'n_splits'

Don't specify data.shape[0], this is no longer necessary in the recent
versions of scikit-learn.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using Scikit-learn graphics for AI workshop, NOKIOS conference

2018-10-19 Thread Gael Varoquaux
On Fri, Oct 19, 2018 at 09:52:18AM +0200, fabian dietrichson wrote:
> However, I noticed that your images are protected with copy rights, and I’m
> asking if I’m allowed to use your illustration for this purpose?

Which images specifically do you have in mind?

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-02 Thread Gael Varoquaux
On Tue, Oct 02, 2018 at 12:20:40PM -0400, Andreas Mueller wrote:
> I think having solution is to have MS, FB, Amazon, IBM, Nvidia, intel,...
> maintain our generic persistent code is a decent deal for us if it works out 
> ;)

> https://onnx.ai/

I'll take that deal! :)

+1 for onnx, absolutely!

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-02 Thread Gael Varoquaux
On Fri, Sep 28, 2018 at 09:45:16PM +0100, Javier López wrote:
> This is not the whole truth. Yes, you store the sklearn version on the pickle
> and raise a warning; I am mostly ok with that, but the pickles are brittle and
> oftentimes they stop loading when other versions of other stuff change. I am
> not talking about "Warning: wrong version", but rather "Unpickling error:
> expected bytes, found tuple" that prevent the file from loading entirely.
> [...]
> 1. Things in the current state break when something else changes, not only
> sklearn.
> 2. Sharing pickles is a bad practice due to a number of reasons.

The reason that pickles are brittle and that sharing pickles is a bad
practice is that pickle use an implicitly defined data model, which is
defined via the internals of objects.

The "right" solution is to use an explicit data model. This is for
instance what is done with an object database. However, this comes at the
cost of making it very hard to change objects. First, all objects must be
stored with a schema (or language) that is rich enough to represent it,
and yet defined somewhat explicitly (to avoid running into the problems
of pickle). Second, if the internal representation of the object change,
there needs to be explicit conversion code to go from one version to the
next. Typically, upgrade of websites that use object database need
maintainers to write this conversion code.


So, the problems of pickle are not specific to pickle, but rather
intrinsic to any generic persistence code [*]. Writing persistence code that
does not fall in these problems is very costly in terms of developer time
and makes it harder to add new methods or improve existing one. I am not
excited about it.

Rather, the good practice is that if you want to deploy model you deploy
on the exact same environment that you have trained them. The web world
is very used to doing that (because they keep falling in these problems),
and has developed technology to do this, such as docker containers. I
know that it is clunky technology. I don't like it myself, but I don't
see a way out of it with our resources.

Gaël

[*] Back in the days, when I was working on Mayavi, we developed our
persistence code, because we were not happy with pickle. It was not
pleasant to maintain, and had the same "smell" as pickle. I don't think
that it was a great use of our time.

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Full time jobs to work on scikit-learn in Paris

2018-09-28 Thread Gael Varoquaux
Dear list,

I am very happy to announce that the Inria foundation is looking to hire
two persons to work with the scikit-learn in France:

* One Community and Operation Officer:
  https://scikit-learn.fondation-inria.fr/job_coo/
  We need a good mix of communication, organizational, and technical skills 
  to help the team and the community work best together

* One Performance and Quality Engineer:
  https://scikit-learn.fondation-inria.fr/en/job_performance/
  We need someone who care about tests, continuous integration and
  performance, to help making scikit-learn faster will guaranteeing that
  it stays as solid as it is.

Please forward this announcement to anyone who might be interested.

Best,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-26 Thread Gael Varoquaux
Hurray, thanks to everybody; in particular for those who did the hard
work of ironing out the last issues and releasing.

Gaël

On Wed, Sep 26, 2018 at 02:55:57PM -0400, Andreas Mueller wrote:
> Hey everbody!
> I'm happy to (finally) announce scikit-learn 0.20.0.
> This release is dedicated to the memory of Raghav Rajagopalan.

> You can upgrade now with pip or conda!

> There is many important additions and updates, and you can find the full
> release notes here:
> http://scikit-learn.org/stable/whats_new.html#version-0-20

> My personal highlights are the ColumnTransformer and the changes to
> OneHotEncoder,
> but there's so much more!

> An important note is that this is the last version to support Python2.7, and
> the
> next release will require Python 3.5.

> A big thank you to everybody who contributed and special thanks to Joel!

> All the best,
> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Announcing scikit-learn at Inria foundation

2018-09-17 Thread Gael Varoquaux
Hi scikit-learn community,

I am very happy to announce a foundation to support scikit-learn at
Inria:
https://scikit-learn.fondation-inria.fr

In practice, this gives us a legal vessel to receive money from private
entities, and not only research money. This money will provide a stable
job to some people working on scikit-learn here, at Inria, and allow us
to grow the team and target more ambitious features as well as better
quality and hopefully more frequent releases.

I have written a blog post about the motivations and the vision behind
this new development:
http://gael-varoquaux.info/programming/a-foundation-for-scikit-learn-at-inria.html

Thank you all for being part of the scikit-learn adventure. I am very
excited about the new prospects that this is bringing us,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Optimization algorithms in scikit-learn

2018-09-04 Thread Gael Varoquaux
This is out of the scope of scikit-learn, which is a toolkit meant to be
used for easier machine learning. Optimization is a component of machine
learning, but not one that is readily-useable by itself.

Gaël

On Tue, Sep 04, 2018 at 12:45:09PM -0600, Touqir Sajed wrote:
> Hi Andreas,

> Is there a particular reason why there is no general purpose optimization
> module? Most of the optimizers (atleast the first order methods) are general
> purpose since you just need to feed the gradient. In some special cases, you
> probably need problem specific formulation for better performance. The
> advantage of SVRG is that you don't need to store the gradients which costs a
> storage of order number_of_weights*number_of_samples which is the main problem
> with SAG and SAGA. Thus, for most neural network models (and even non-NN
> models) using SAG and SAGA is infeasible on personal computers. 

> SVRG is not popular in deep learning community but it should be noted that 
> SVRG
> is different from Adam since it does not tune the step size. Just to clarify,
> SVRG can be faster than Adam since it decreases the variance to achieve a
> similar convergence rate as full batch methods while being computationally
> cheap like SGD/Adam. However, one can combine both methods to obtain an even
> faster algorithm.

> Cheers,
> Touqir
> *

> On Tue, Sep 4, 2018 at 11:46 AM Andreas Mueller  wrote:

> Hi Touqir.
> We don't usually implement general purpose optimizers in
> scikit-learn, in particular because usually different optimizers
> apply to different kinds of problems.
> For linear models we have SAG and SAGA, for neural nets we have adam.
> I don't think the authors claim to be faster than SAG, so I'm not sure 
> what
> the
> motivation would be for using their method.

> Best,
> Andy


> On 09/04/2018 12:55 PM, Touqir Sajed wrote:

> Hi,

> I have been looking for stochastic optimization algorithms in
> scikit-learn that are faster than SGD and so far I have come across
> Adam and momentum. Are there other methods implemented in 
> scikit-learn?
> Particularly, the variance reduction methods such as SVRG (https://
> papers.nips.cc/paper/
> 
> 4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf
> ) ? These variance reduction methods are the current state of the art
> in terms of convergence speed while maintaining runtime complexity of
> order n -- number of features. If they are not implemented yet, I 
> think
> it would be really great to implement(I am happy to do so) them since
> nowadays working on large datasets(where LBGFS may not be practical) 
> is
> the norm where the improvements are definitely worth it.

> Cheers,
> Touqir
-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available

2018-09-01 Thread Gael Varoquaux
Thanks to everybody involved! This is big!

Gaël

On Fri, Aug 31, 2018 at 09:26:39PM -0400, Andreas Mueller wrote:
> Hey Folks!

> I'm happy to announce that the scikit-learn 0.20 release candidate 1 is now
> available via conda-forge and pip.
> Please help us by testing this release candidate so we can make sure the
> final release will go seamlessly!

> You can install the release candidate from conda-forge using

> conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge

> (please take into account that if you're using the default conda channel
> otherwise, this will pull in some other
> dependencies from conda-forge).

> You can install the release candidate via pip using

> pip install --pre scikit-learn

> The documentation for 0.20 is available at

> http://scikit-learn.org/0.20/

> and will move to http://scikit-learn.org/ upon final release.

> You can find the release note with all new features and changes here:

> http://scikit-learn.org/0.20/whats_new.html#version-0-20

> Thank you for your help in testing the RC and thank you to everybody that
> made the release possible!

> All the best,

> Andy

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Small suggestion for documentation

2018-08-07 Thread Gael Varoquaux
I think that the vocabulary mismatch comes from the fact that you are looking 
at these terms thinking about in sample statistics, while they are used here in 
the context of prediction. I think that in the context of prediction, these are 
the right terms.

Cheers,

Gaël

⁣Sent from my phone. Please forgive typos and briefness.​

On Aug 7, 2018, 10:40, at 10:40, Fellype via scikit-learn 
 wrote:
>Dear maintainers,I've just known scikit-learn and found it very useful.
>Congratulations for this library.
>I found some confuse terms to describe r2_score parameters in
>documentation [1]. For me, the meanings of y_true and y_pred are not
>clear. From [1]:- y_true: ... Ground truth (correct) target values-
>y_pred: ... Estimated target values
>Since the R^2 value is usually used to compare the behavior of
>experimental data (observed) with a theoretical model or standard data
>(expected), I guess that it would be better to change the description
>of y_true and y_pred to something like:- y_true: ... Observed (or
>measured) target values- y_pred: ... Expected (or theoretical) target
>values
>I also think that the same should be done in documentation of other
>scikit-learn functions that use the y_true and y_pred terms with the
>same meaning.
>
>Thanks for your attention and best wishes.
>Fellype
>
>[1]
>http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Query about an algorithm

2018-07-31 Thread Gael Varoquaux
You'll find generic optimization algorithms in scipy.optimize, and not in
scikit-learn.

Best,

Gaël

On Tue, Jul 31, 2018 at 11:49:15PM +, Shantanu Bhattacharya via 
scikit-learn wrote:
> Hello,

> I am new to this mailing list. I would like to understand the algorithms
> provided.

> Is second order gradient descent with hessian error matrix supported by this
> library?

> I went through the documentation, but did not find it. Are you able to confirm
> or direct me to some place that might have it?

> Look forward to your thoughts

> Kind regards
> Shantanu

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Suggestion to update the code for Segmenting the picture of Lena in regions

2018-07-28 Thread Gael Varoquaux
You are looking at an old version of the documentation. In the up to date 
documentation, the picture has been replaced:
http://scikit-learn.org/stable/auto_examples/cluster/plot_face_segmentation.html


⁣Sent from my phone. Please forgive typos and briefness.​

On Jul 29, 2018, 05:51, at 05:51, Rajkiran Veldur  
wrote:
> Hello Team,
>
>I have been following scikit-learn closely these days as I have been
>working on different machine learning algorithms. Thank you for making
>everything so simple. Your documents could be followed even by novice.
>
>Now, when I was working with spectral clustering, I found your example
>of  *Segmenting
>the picture of Lena in regions *intuitive and wanted to try it.
>
>However, scipy has removed the scipy.misc.lena() module from their
>library,
>due to licensing issues.
>
>So, I request you to please update the code with any other image.
>
>Regards,
>Rajkiran Veldur
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] What is the FeatureAgglomeration algorithm?

2018-07-25 Thread Gael Varoquaux
No.

⁣Sent from my phone. Please forgive typos and briefness.​

On Jul 26, 2018, 07:28, at 07:28, Raphael C  wrote:
>Is it expected that all three linkages options should give the same
>result
>in my toy example?
>
>Raphael
>
>On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux
>
>wrote:
>
>> FeatureAgglomeration uses the Ward, complete linkage, or average
>linkage,
>> algorithms, depending on the choice of "linkage". These are well
>> documented in the literature, or on wikipedia.
>>
>> Gaël
>>
>> On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
>> > Hi,
>>
>> > I am trying to work out what, in precise mathematical terms,
>> > [FeatureAgglomeration][1] does and would love some help. Here is
>some
>> example
>> > code:
>>
>>
>> > import numpy as np
>> > from sklearn.cluster import FeatureAgglomeration
>> > for S in ['ward', 'average', 'complete']:
>> > FA = FeatureAgglomeration(linkage=S)
>> > print(FA.fit_transform(np.array([[-50,6,6,7,],
>[0,1,2,3]])))
>>
>> > This outputs:
>>
>> >
>>
>> > [[  6. -50.]
>> >  [  2.   0.]]
>> > [[  6. -50.]
>> >  [  2.   0.]]
>> > [[  6. -50.]
>> >  [  2.   0.]]
>>
>> > Is it possible to say mathematically how these values have been
>computed?
>>
>> > Also, what exactly does linkage do and why doesn't it seem to make
>any
>> > difference which option you choose?
>>
>> > Raphael
>>
>>
>> >   [1]: http://scikit-learn.org/stable/modules/generated/
>> > sklearn.cluster.FeatureAgglomeration.html
>>
>> > PS I also asked at
>> > https://stackoverflow.com/questions/51526616/
>> >
>>
>what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
>>
>>
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> --
>> Gael Varoquaux
>> Senior Researcher, INRIA Parietal
>> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>> Phone:  ++ 33-1-69-08-79-68
>> http://gael-varoquaux.info
>http://twitter.com/GaelVaroquaux
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] What is the FeatureAgglomeration algorithm?

2018-07-25 Thread Gael Varoquaux
FeatureAgglomeration uses the Ward, complete linkage, or average linkage,
algorithms, depending on the choice of "linkage". These are well
documented in the literature, or on wikipedia.

Gaël

On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
> Hi,

> I am trying to work out what, in precise mathematical terms,
> [FeatureAgglomeration][1] does and would love some help. Here is some example
> code:


>     import numpy as np
>     from sklearn.cluster import FeatureAgglomeration
>     for S in ['ward', 'average', 'complete']:
>         FA = FeatureAgglomeration(linkage=S)
>         print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))

> This outputs:

>    

>     [[  6. -50.        ]
>      [  2.           0.        ]]
>     [[  6. -50.        ]
>      [  2.           0.        ]]
>     [[  6. -50.        ]
>      [  2.           0.        ]]

> Is it possible to say mathematically how these values have been computed?

> Also, what exactly does linkage do and why doesn't it seem to make any
> difference which option you choose?

> Raphael


>   [1]: http://scikit-learn.org/stable/modules/generated/
> sklearn.cluster.FeatureAgglomeration.html

> PS I also asked at 
> https://stackoverflow.com/questions/51526616/
> what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] RFE with logistic regression

2018-07-25 Thread Gael Varoquaux
On Wed, Jul 25, 2018 at 12:36:55PM +0200, Benoît Presles wrote:
> Do you think the problems I have can come from correlated features? Indeed,
> in my dataset I have some highly correlated features.

Yes, in general selecting features conditionally on others is very hard
when features are highly correlated.

> Do you think this could explain why I don't get reproducible and consistent
> results?

Yes.

> Thanks for your help,
> Ben


> Le 24/07/2018 à 23:44, bthirion a écrit :
> > Univariate screening is somewhat hackish too, but much more stable --
> > and cheap.
> > Best,

> > Bertrand

> > On 24/07/2018 23:33, Benoît Presles wrote:
> > > So you think that I cannot get reproducible and consistent results
> > > with this method ?
> > > If you would avoid RFE, which method do you suggest to find the best
> > > features ?

> > > Ben


> > > Le 24/07/2018 à 21:34, Gael Varoquaux a écrit :
> > > > On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote:
> > > > > 3. With C=1, it seems that I have the same results at each run for all
> > > > > solvers (liblinear, sag and saga), however the ranking is not the same
> > > > > between the solvers.
> > > > Your problem is probably ill-conditioned, hence the specific weights on
> > > > the features are not stable. There isn't a good answer to ordering
> > > > features, they are degenerate.

> > > > In general, I would avoid RFE, it is a hack, and can easily lead
> > > > to these
> > > > problems.

> > > > Gaël

> > > > > Thanks for your help,
> > > > > Ben

> > > > > PS1: I checked and n_iter_ seems to be always lower than max_iter.
> > > > > PS2: my data is scaled, I am using "StandardScaler".


> > > > > Le 24/07/2018 à 20:33, Andreas Mueller a écrit :

> > > > > > On 07/24/2018 02:07 PM, Benoît Presles wrote:
> > > > > > > I did the same tests as before adding fit_intercept=False and:
> > > > > > > 1. I have got the same problem as before, i.e. when I execute the
> > > > > > > RFE multiple times I don't get the same ranking each time.
> > > > > > > 2. When I change the solver to 'sag'
> > > > > > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, 
> > > > > > > max_iter=1,
> > > > > > > fit_intercept=False, solver='sag')), it seems that I get the same
> > > > > > > ranking at each run. This is not the case with the 'saga' solver.
> > > > > > > The ranking is not the same between the solvers.
> > > > > > > 3. With C=1, it seems that I have the same results at each run for
> > > > > > > all solvers (liblinear, sag and saga), however the ranking is not
> > > > > > > the same between the solvers.

> > > > > > > How can I get reproducible and consistent results?
> > > > > > Did you scale your data? If not, saga and sag will basically fail.
> > > > > > ___
> > > > > > scikit-learn mailing list
> > > > > > scikit-learn@python.org
> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > > ___
> > > > > scikit-learn mailing list
> > > > > scikit-learn@python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn

> > > ___
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn


> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Gael Varoquaux
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote:
> 3. With C=1, it seems that I have the same results at each run for all
> solvers (liblinear, sag and saga), however the ranking is not the same
> between the solvers.

Your problem is probably ill-conditioned, hence the specific weights on
the features are not stable. There isn't a good answer to ordering
features, they are degenerate.

In general, I would avoid RFE, it is a hack, and can easily lead to these
problems.

Gaël

> Thanks for your help,
> Ben


> PS1: I checked and n_iter_ seems to be always lower than max_iter.
> PS2: my data is scaled, I am using "StandardScaler".



> Le 24/07/2018 à 20:33, Andreas Mueller a écrit :


> > On 07/24/2018 02:07 PM, Benoît Presles wrote:
> > > I did the same tests as before adding fit_intercept=False and:

> > > 1. I have got the same problem as before, i.e. when I execute the
> > > RFE multiple times I don't get the same ranking each time.

> > > 2. When I change the solver to 'sag'
> > > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=1,
> > > fit_intercept=False, solver='sag')), it seems that I get the same
> > > ranking at each run. This is not the case with the 'saga' solver.
> > > The ranking is not the same between the solvers.

> > > 3. With C=1, it seems that I have the same results at each run for
> > > all solvers (liblinear, sag and saga), however the ranking is not
> > > the same between the solvers.


> > > How can I get reproducible and consistent results?

> > Did you scale your data? If not, saga and sag will basically fail.
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] NEP: Random Number Generator Policy

2018-06-19 Thread Gael Varoquaux
On Mon, Jun 18, 2018 at 11:34:38PM -0700, Robert Kern wrote:
> However, I'd really appreciate it if I could get some
> kind of feedback from a scikit-learn dev,

I didn't read the NEP, only your summary. That said, it seems quite
reasonably aligned with our practice, and hence shouldn't pose a problem.
Ideally, I believe that in the long run it should enable us to have
cleaner / more robust code, but I suspect that it will take a while
before we get there.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] (no subject)

2018-05-24 Thread Gael Varoquaux
On Thu, May 24, 2018 at 09:35:00PM +0530, aijaz qazi wrote:
> scikit- multi learn is misleading.

Yes, but I am not sure what scikit-learn should do about this.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-16 Thread Gael Varoquaux
On Wed, May 16, 2018 at 01:44:17PM -0400, Andreas Mueller wrote:
> Should we have "low memory"/batched version of k_neighbors_graph and
> epsilon_neighbors_graph functions? I assume
> those instantiate the dense matrix right now.

+1!

It shouldn't be too hard to do.

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Announcing IMPAC: an IMaging-PsychiAtry Challenge, using data-science to predict autism from brain imaging

2018-05-05 Thread Gael Varoquaux

Dear colleagues,

It is my pleasure to announce IMPAC: an IMaging-PsychiAtry Challenge,
using data-science to predict autism from brain imaging.

https://paris-saclay-cds.github.io/autism_challenge/

This is a machine-learning challenge on brain-imaging data to achieve the
best prediction of autism spectrum disorder diagnostic status. We are
providing the largest cohort so far to learn such predictive biomarkers,
with more than 2000 individuals.

There is a total of 9000 euros of prices to win for the best prediction.
The prediction quality will be measured on a large hidden test set, to
ensure fairness.

We provide a simple starting kit to serve as a proof of feasibility. We
are excited to see what the community will come up with in terms of
predictive models and of score.

Best,

Gaël

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] NearestNeighbors without replacement

2018-04-03 Thread Gael Varoquaux
Matching to minimize a cost is known as the linear assignment problem,
can be solved in n^3 cost, and is implemented in scikit-learn in
sklearn.utils.linear_assignment_.linear_assignment or in recent versions
of scipy as scipy.optimize.linear_sum_assignment

Of course, this problem will require much more coding (you need to build
your pairwise cost matrix) and much more computing cost (n^3 instead of
n^2) than a standard nearest-neighbor.

Gaël

On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> Hi Jake,

> Thanks for the reply. Yes, trying this out resulted from looking for ways in
> python to implement propensity score matching. I found a package, pscore_match
> (http://www.kellieottoboni.com/pscore_match/), but the matching was really
> terrible. Specifically, I'm matching based on age, race, gender, HIV status,
> hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for
> matching performed WAY better, I was so surprised at how well every factor was
> matched for. The only issue is that it uses replacement. 

> Here's what I'm currently testing. I need each case to match to 20 controls, 
> so
> since NearestNeighbors uses replacement, I'm matching each case to many
> controls (15000), taking all of the distances for all of the pairs, and
> retaining only the smallest distances for each control. Since many controls 
> are
> re-used (since the algorithm uses replacement), the hope is that enough
> controls are matched to many different cases so that each case ends up being
> matched to 20 unique controls. Does this method make sense??

> Best,

> Randy  

> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jake...@cs.washington.edu>
> wrote:

> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljel...@gmail.com>
> wrote:

> Hello to the Scikit-learn community!

> I am doing case-control matching for an electronic health records
> study. My question is, is it possible to run Sklearn's 
> NearestNeighbors
> function without replacement? As in, match the treated group to the
> untreated group without re-using any of the untreated group data
> points? If so, how? By default, it uses replacement. I know this
> because I tested it on some data of mine.

> The code I used is in the confirmed answer here: https://
> stats.stackexchange.com/questions/206832/matched-pai
> rs-in-python-propensity-score-matching

> Thanks so much in advance,


> No, pairwise matching without replacement is not implemented within
> scikit-learn's nearest neighbors routines.

> It seems like an algorithm you would have to think carefully about because
> the number of potential pairs grows exponentially with the number of
> points, and I don't think it's true that choosing the nearest available
> neighbor of points in sequence will guarantee you to find the optimal
> configuration. You'd also have to carefully define what you mean by
> "optimal"... are you seeking to minimize the sum of all distances? The sum
> of squared distances? The maximum distance? The results would change
> depending on the metric you define. And you'd probably have to figure out
> some way to reduce the exponential search space in order to calculate the
> result in a reasonable amount of time for your data.

> You might look into the literature on propensity score matching; I think
> that's one area where this kind of neighbors-without-replacement algorithm
> is often used.

> Best,
>    Jake
>  
-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Handle uncertainties in NMF

2018-02-13 Thread Gael Varoquaux
Hi Samuël,

On Tue, Feb 13, 2018 at 05:42:54PM +0100, Samuël Weber wrote:
> I was wondering if handling uncertainties in NMF would be possible.
> Indeed, in NMF we minimize a Frobenius norm ||X - WH||², so we may
> quite easily minimize ||(X - WH) / U||², with U the matrix of
> uncertainty.

You can divide your data X by U, run the standard matrix factorization
solver, and multiply the resulting matrix H by U and you'll get the
result that you want.

Best,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Gael Varoquaux
On Sun, Jan 28, 2018 at 08:29:58PM +1100, Joel Nothman wrote:
> I can't say it's especially obvious that these features available, and
> improvements to the documentation are welcome, but CountVectorizer is
> complicated enough and we would rather avoid more parameters if we can.

Same feeling here. I am afraid of the crowing effect that makes it harder
and harder to find things as we add them.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier as a feature selector

2017-12-29 Thread Gael Varoquaux
I think that a transform method would be good. We would have to add a parameter 
to the constructor to specify which layer is used for the transform. It should 
default to "-1", in my opinion.

Cheers,

Gaël

⁣Sent from my phone. Please forgive typos and briefness.​

On Dec 29, 2017, 17:48, at 17:48, "Javier López"  wrote:
>Hi Thomas,
>
>it is possible to obtain the activation values of any hidden layer, but
>the
>procedure is not completely straight forward. If you look at the code
>of
>the `_predict` method of MLPs you can see the following:
>
>```python
>def _predict(self, X):
>"""Predict using the trained model
>
>Parameters
>--
>X : {array-like, sparse matrix}, shape (n_samples, n_features)
>The input data.
>
>Returns
>---
>  y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
>The decision function of the samples for each class in the
>model.
>"""
>X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
>
># Make sure self.hidden_layer_sizes is a list
>hidden_layer_sizes = self.hidden_layer_sizes
>if not hasattr(hidden_layer_sizes, "__iter__"):
>hidden_layer_sizes = [hidden_layer_sizes]
>hidden_layer_sizes = list(hidden_layer_sizes)
>
>layer_units = [X.shape[1]] + hidden_layer_sizes + \
>[self.n_outputs_]
>
># Initialize layers
>activations = [X]
>
>for i in range(self.n_layers_ - 1):
>activations.append(np.empty((X.shape[0],
> layer_units[i + 1])))
># forward propagate
>self._forward_pass(activations)
>y_pred = activations[-1]
>
>return y_pred
>```
>
>the line `y_pred = activations[-1]` is responsible for extracting the
>values for the last layer,
>but the `activations` variable contains the values for all the neurons.
>
>You can make this function into your own external method (changing the
>`self` attribute by
>a proper parameter) and add an extra argument which specifies the
>layer(s)
>that you want.
>I have done this myself in order to make an AutoEncoderNetwork out of
>the
>MLP
>implementation.
>
>This makes me wonder, would it be worth adding this to sklearn?
>A very simple way would be to refactor the `_predict` method, with the
>additional layer
>argument, to a new method `_predict_layer`, then we can have the
>`_predict`
>method
>simply call `_predict_layer(..., layer=-1)` and add a new method
>(perhaps a
>`transform`?)
>that allows to get (raveled) values for an arbitrary subset of the
>layers.
>
>I'd be happy to submit a PR if you guys think it would be interesting
>for
>the project.
>
>Javier
>
>
>
>On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis 
>wrote:
>
>> Greetings,
>>
>> I want to train a MLPClassifier with one hidden layer and use it as a
>> feature selector for an MLPRegressor.
>> Is it possible to get the values of the neurons from the last hidden
>layer
>> of the MLPClassifier to pass them as input to the MLPRegressor?
>>
>> If it is not possible with scikit-learn, is anyone aware of any
>> scikit-compatible NN library that offers this functionality? For
>example
>> this one:
>>
>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>
>> I wouldn't like to do this in Tensorflow because the MLP there is
>much
>> slower than scikit-learn's implementation.
>>
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Gael Varoquaux
With as few data points, there is a huge uncertainty in the estimation of
the prediction accuracy with cross-validation. This isn't a problem of
the method, is it a basic limitation of the small amount of data. I've
written a paper on this problem is the specific context of neuroimaging:
https://www.sciencedirect.com/science/article/pii/S1053811917305311
(preprint: https://hal.inria.fr/hal-01545002/).

I except that what you are seing in sampling noise: the result has
confidence intervals in large than 10%.

Gaël


On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote:
> Hello,

> I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing
> decoding between two classes, each of which has 24 exemplars per class. I am
> comparing two different methods of cross-validation for my data: in one, I am
> training on 23 exemplars from each class, and testing on the remaining example
> from each class, and in the other, I am training on 22 exemplars from each
> class, and testing on the remaining two from each class (in case it matters,
> the data is structured into different neuroimaging "runs", with each "run"
> containing several "blocks"; the first cross-validation method is leaving out
> one block at a time, the second is leaving out one run at a time). 

> Now, I would've thought that these two CV methods would be very similar, since
> the vast majority of the training data is the same; the only difference is in
> adding two additional points. However, they are yielding very different
> results: training on 23 per class is yielding 60% decoding accuracy (averaged
> across several subjects, and statistically significantly greater than chance),
> training on 22 per class is yielding chance (50%) decoding. Leaving aside the
> particulars of fMRI in this case: is it unusual for single points (amounting 
> to
> less than 5% of the data) to have such a big influence on SVM decoding? I am
> using a cost parameter of C=1. I must say it is counterintuitive to me that
> just a couple points out of two dozen could make such a big difference.

> Thank you very much, and cheers,

> JohnMark

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using scikit-learn HTML theme for Sphinx docs in another project

2017-12-04 Thread Gael Varoquaux
You're not infringing copyright (this is BSD-licensed). The only thing is
that we would like you to indicate clearly that the project is not
scikit-learn, so that we don't recieve support calls. For this, in
addition to text pointing it out, you should use a different logo and a
different icon the browser's tab.

Cheers,

Gaël

On Mon, Dec 04, 2017 at 02:09:09PM +0100, Iacopo Poli wrote:
> Hello everyone,

> I'm working on a project that is implemented following quite strictly the
> scikit-learn API and I would like to use the scikit-learn Sphinx theme for the
> docs.

> I would do that only if I don't infringe any copyright and whatsoever. What's
> your policy in this regard?

> Cheers,
> Iacopo

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Rapid Outlier Detection via Sampling

2017-11-25 Thread Gael Varoquaux
Dear Orges,

I can see only 33 citations on Google scholar for this paper.

As detailed in the inclusion criteria of scikit-learn:
http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
I am afraid that we need many more citations to include this algorithm.

However, you could submit it for inclusion to scikit-learn-contrib:
http://contrib.scikit-learn.org/

Best,

Gaël

On Sat, Nov 25, 2017 at 07:34:42PM +0100, Orges Leka wrote:
> Dear scikit-learn Developers,

> My Name is Orges Leka and I would like to implement 
> "Rapid Outlier Detection via Sampling" [1] in scikit-learn.
> In R this method is already available [2] by the authors of the method.

> In Python I have not seen any implementation yet. The method is very simple 
> yet
> effective as the authors show. First one selects say 20 points. Then computes
> the shortest distance of all other points to these 20 points. This is the
> outlier-score for one specific point. 

> It would be nice to implement this with different metrics / distances (euclid,
> manhattan or other metrics) .

> How would I start the implementation? I have already git-cloned scikit-learn 
> on
> my pc. Do I need to write object oriented or are functions also ok?

> If this succeeds, I would also like to extend the "example-outliers" doc with
> the above method.

> Kind regards
> Dipl. Math. Orges Leka

> [1] https://papers.nips.cc/paper/
> 5127-rapid-distance-based-outlier-detection-via-sampling.pdf
> [2] https://github.com/mahito-sugiyama/sampling-outlier-detection


> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] New core devs: Hanmin Qin, Guillaume Lemaître, and Roman Yurchak

2017-11-09 Thread Gael Varoquaux
Hi scikit-learn community,

A week ago, we added 3 core developers, but I think that we forgot to
announce it. So let me please welcome on board Hanmin Qin, Guillaume
Lemaître, and Roman Yurchak. They have been very active in the
development of the project, and very helpful in the review process. It's
a pleasure to see the team growing.

Gaël

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] scikit-learn 0.19.1 is out!

2017-10-23 Thread Gael Varoquaux
Hurray! Great job; thanks to all involved!

Gaël

On Mon, Oct 23, 2017 at 12:23:11PM -0400, Andreas Mueller wrote:
> Hey everybody.

> We just released 0.19.1, fixing some issues and bugs in the last release.
> It's highly recommended you upgrade your installation. The new release is
> available via pip, conda (main) and conda-forge.

> A big thank you to everybody who contributed, in particular Joel
> (@jnothman)!

> The release includes several improvements and fixes to the model_selection
> and
> pipeline modules, and t-SNE.

> You can find the full changelog here:
> http://scikit-learn.org/stable/whats_new.html#version-0-19-1

> Happy learning!

> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Gael Varoquaux
I took my example in classification for didactic purposes. My hypothesis still 
holds that the splitting of the data creates anti correlations between train 
and test (a depletion effect).

Basically , you shouldn't work with datasets that small.

Gaël

⁣Sent from my phone, please excuse typos and briefness​

On Sep 26, 2017, 18:51, at 18:51, Thomas Evangelidis <teva...@gmail.com> wrote:
>I have very small training sets (10-50 observations). Currently, I am
>working with 16 observations for training and 25 for validation
>(external
>test set). And I am doing Regression, not Classification (hence the SVR
>instead of SVC).
>
>
>On 26 September 2017 at 18:21, Gael Varoquaux
><gael.varoqu...@normalesup.org
>> wrote:
>
>> Hypothesis: you have a very small dataset and when you leave out
>data,
>> you create a distribution shift between the train and the test. A
>> simplified example: 20 samples, 10 class a, 10 class b. A
>leave-one-out
>> cross-validation will create a training set of 10 samples of one
>class, 9
>> samples of the other, and the test set is composed of the class that
>is
>> minority on the train set.
>>
>> G
>>
>> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
>> > Greetings,
>>
>> > I don't know if anyone encountered this before, but sometimes I get
>> > anti-correlated predictions by the SVR I that am training. Namely,
>the
>> > Pearson's R and Kendall's tau are negative when I compare the
>> predictions on
>> > the external test set with the true values. However, the SVR
>predictions
>> on the
>> > training set have positive correlations with the experimental
>values and
>> hence
>> > I can't think of a way to know in advance if the trained SVR will
>produce
>> > anti-correlated predictions in order to change their sign and avoid
>the
>> > disaster. Here is an example of what I mean:
>>
>> > Training set predictions: R=0.452422, tau=0.33
>> > External test set predictions: R=-0.537420, tau-0.30
>>
>> > Obviously, in a real case scenario where I wouldn't have the
>external
>> test set
>> > I would have used the worst observation instead of the best ones.
>Has
>> anybody
>> > any idea about how I could prevent this?
>>
>> > thanks in advance
>> > Thomas
>> --
>> Gael Varoquaux
>> Researcher, INRIA Parietal
>> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>> Phone:  ++ 33-1-69-08-79-68
>> http://gael-varoquaux.info
>http://twitter.com/GaelVaroquaux
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
>--
>
>==
>
>Dr Thomas Evangelidis
>
>Post-doctoral Researcher
>CEITEC - Central European Institute of Technology
>Masaryk University
>Kamenice 5/A35/2S049,
>62500 Brno, Czech Republic
>
>email: tev...@pharm.uoa.gr
>
>  teva...@gmail.com
>
>
>website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Gael Varoquaux
Hypothesis: you have a very small dataset and when you leave out data,
you create a distribution shift between the train and the test. A
simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out
cross-validation will create a training set of 10 samples of one class, 9
samples of the other, and the test set is composed of the class that is
minority on the train set.

G

On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
> Greetings,

> I don't know if anyone encountered this before, but sometimes I get
> anti-correlated predictions by the SVR I that am training. Namely, the
> Pearson's R and Kendall's tau are negative when I compare the predictions on
> the external test set with the true values. However, the SVR predictions on 
> the
> training set have positive correlations with the experimental values and hence
> I can't think of a way to know in advance if the trained SVR will produce
> anti-correlated predictions in order to change their sign and avoid the
> disaster. Here is an example of what I mean:

> Training set predictions: R=0.452422, tau=0.33
> External test set predictions: R=-0.537420, tau-0.30

> Obviously, in a real case scenario where I wouldn't have the external test set
> I would have used the worst observation instead of the best ones. Has anybody
> any idea about how I could prevent this?

> thanks in advance
> Thomas
-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search

2017-09-13 Thread Gael Varoquaux
On Wed, Sep 13, 2017 at 02:45:41PM -0400, Andreas Mueller wrote:
> We could add a way to call non-standard methods, but I'm not sure that is the
> right way to go.
> (like pipeline.custom_method(X, method="kneighbors")). But that assumes that
> the method signature is X or (X, y).
> So I'm not sure if this is generally useful.

I don't see either why it's useful. We shouldn't add a method for
everything that can be easily coded with a few lines of Python. The nice
thing of Python is that it is such an expressive language.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


  1   2   >