Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:

> On 11/20/18 4:43 PM, Gael Varoquaux wrote:
> > We are planning to do heavy benchmarking of those strategies, to figure
> > out tradeoff. But we won't get to it before February, I am afraid.
> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder

I'd rather not. Or rather, I'd rather have some benchmarks on it (it
doesn't have to be us that does it).

> I would really like to add it before February

A few month to get it right is not that bad, is it?

> and it's pretty established.

Are there good references studying it? If they is a clear track of study,
it falls in the usual rules, and should go in.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller



On 11/20/18 4:43 PM, Gael Varoquaux wrote:

We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

Does that mean you'd be opposed to adding the leave-one-out TargetEncoder

before you do this? I would really like to add it before February and 
it's pretty established.



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller




On 11/20/18 4:16 PM, Gael Varoquaux wrote:

- the naive way is not the right one: just computing the average of y
   for each category leads to overfitting quite fast

- it can be done cross-validated, splitting the train data, in a
   "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53)

This is called leave-one-out in the category_encoding library, I think,
and that's what my first implementation would be.


- it can be done using empirical-Bayes shrinkage, which is what we
   currently do in dirty_cat.

Reference / explanation?


We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

aww ;)
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote:
> I would love to see the TargetEncoder ported to scikit-learn.
> The CountFeaturizer is pretty stalled:
> https://github.com/scikit-learn/scikit-learn/pull/9614

So would I. But there are several ways of doing it:

- the naive way is not the right one: just computing the average of y
  for each category leads to overfitting quite fast

- it can be done cross-validated, splitting the train data, in a
  "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53)

- it can be done using empirical-Bayes shrinkage, which is what we
  currently do in dirty_cat.

We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

> Have you benchmarked the other encoders in the category_encoding lib?
> I would be really curious to know when/how they help.

We did (part of the results are in the publication), and we didn't
have great success.

Gaël

> On 11/20/18 3:58 PM, Gael Varoquaux wrote:
> > Hi scikit-learn friends,

> > As you might have seen on twitter, my lab -with a few friends- has
> > embarked on research to ease machine on "dirty data". We are
> > experimenting on new encoding methods for non-curated string categories.
> > For this, we are developing a small software project called "dirty_cat":
> > https://dirty-cat.github.io/stable/

> > dirty_cat is a test bed for new ideas of "dirty categories". It is a
> > research project, though we still try to do decent software engineering
> > :). Rather than contributing to existing codebases (as the great
> > categorical-encoding project in scikit-learn-contrib), we spanned it out
> > in a separate software project to have the freedom to try out ideas that
> > we might give up after gaining insight.

> > We hope that it is a useful tool: if you have non-curated string
> > categories, please give it a try. Understanding what works and what does
> > not is important to know what to consolidate. Hopefully one day we can
> > develop a tool that is of wide-enough interest that it can go in
> > scikit-learn-contrib, or maybe even scikit-learn.

> > Also, if you have suggestions of publicly available databases that we try
> > it upon, we would love to hear from you.

> > Cheers,

> > Gaël

> > PS: if you want to work on dirty-data problems in Paris as a post-doc or
> > an engineer, send me a line
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Gael Varoquaux
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller

I would love to see the TargetEncoder ported to scikit-learn.
The CountFeaturizer is pretty stalled:
https://github.com/scikit-learn/scikit-learn/pull/9614

:-/

Have you benchmarked the other encoders in the category_encoding lib?
I would be really curious to know when/how they help.


On 11/20/18 3:58 PM, Gael Varoquaux wrote:

Hi scikit-learn friends,

As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
https://dirty-cat.github.io/stable/

dirty_cat is a test bed for new ideas of "dirty categories". It is a
research project, though we still try to do decent software engineering
:). Rather than contributing to existing codebases (as the great
categorical-encoding project in scikit-learn-contrib), we spanned it out
in a separate software project to have the freedom to try out ideas that
we might give up after gaining insight.

We hope that it is a useful tool: if you have non-curated string
categories, please give it a try. Understanding what works and what does
not is important to know what to consolidate. Hopefully one day we can
develop a tool that is of wide-enough interest that it can go in
scikit-learn-contrib, or maybe even scikit-learn.

Also, if you have suggestions of publicly available databases that we try
it upon, we would love to hear from you.

Cheers,

Gaël

PS: if you want to work on dirty-data problems in Paris as a post-doc or
an engineer, send me a line
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
Hi scikit-learn friends,

As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
https://dirty-cat.github.io/stable/

dirty_cat is a test bed for new ideas of "dirty categories". It is a
research project, though we still try to do decent software engineering
:). Rather than contributing to existing codebases (as the great
categorical-encoding project in scikit-learn-contrib), we spanned it out
in a separate software project to have the freedom to try out ideas that
we might give up after gaining insight.

We hope that it is a useful tool: if you have non-curated string
categories, please give it a try. Understanding what works and what does
not is important to know what to consolidate. Hopefully one day we can
develop a tool that is of wide-enough interest that it can go in
scikit-learn-contrib, or maybe even scikit-learn.

Also, if you have suggestions of publicly available databases that we try
it upon, we would love to hear from you.

Cheers,

Gaël

PS: if you want to work on dirty-data problems in Paris as a post-doc or
an engineer, send me a line
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make all new parameters keyword-only?

2018-11-20 Thread Joris Van den Bossche
Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman :

> I think we're all agreed that this change would be a good thing.
>
> What we're not agreed on is how much risk we take by breaking legacy code
> that relied on argument order.
>

I think that, in principle, it could be possible to do this with a
deprecation warning. If we would do a signature like the following:

class Model(BaseEstimator):
def __init__(self, *args, param1=1, param2=2):

then we could in principle catch all positional args, raise a warning if
there are any, and by inspecting the signature (as we now also do in
_get_param_names), we could set the appropriate parameters on self.
I think the main problem is that this would temporarily "allow" that people
also pass a keyword argument that conflicts with a positional argument
without that it raises an error (as Python normally would do for you), but
you still get the warning.
And of course, it would violate the clean __init__ functions in
scikit-learn that do no validation.

I personally don't know how big the impact would be of simply doing it as
breaking change, but if we think it might be potentially quite big, the
above might be worth considering (otherwise I wouldn't go through the
hassle).

Joris


>
> I'd argue that we've often already broken such code, and that at least now
> it will break with a TypeError rather than silent misbehaviour.
>
> And yet Sebastian's comment implies that there may be a whole raft of
> former MATLAB users writing code without kwargs. Is that a problem if now
> they get a TypeError?
>
> On Fri, 16 Nov 2018 at 16:23, Sebastian Raschka 
> wrote:
>
>> Also want to say that I really welcome this decision/change. Personally,
>> as far as I am aware, I've trying been using keyword arguments consistently
>> for years, except for cases where it is really obvious, like .fit(X_train,
>> y_train), and I believe that it really helped me regarding writing less
>> error-prone code/analyses.
>>
>> Thinking back of the times where I was using MATLAB, it was really clunky
>> and error-prone to import functions and being careful about the argument
>> order.
>>
>> Besides, keynote arguments definitely make code and documentation much
>> more readable (within and esp. across different package versions) despite
>> (or maybe because) being more verbose.
>>
>> Best,
>> Sebastian
>>
>>
>>
>> > On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn <
>> scikit-learn@python.org> wrote:
>> >
>> > As an end-user, I would strongly support the idea of future enforcement
>> of keyword arguments for new parameters.
>> > In my group, we hold a standard that we develop APIs where _all_
>> arguments must be given by keyword (slightly pedantic style, but has shown
>> to have benefits).
>> > Initialization/call-time state checks are done by a class' internal
>> methods.
>> >
>> > As Andy said, one could consider leaving prototypical X,y as
>> positional, but one benefit my group has seen with full keyword
>> parameterization is the ability to write code for small investigations
>> where we are more concerned with effects from parameters rather than the
>> data (e.g., a fixed problem to model, and one wants to first see on the
>> code line what the estimators and their parameterizations were).
>> > If one could shift the sklearn X,y to the back of a function call, it
>> would enable all participants in a face-to-face code review session to
>> quickly see the emphasis/context of the discussion and move to the
>> conclusion faster.
>> >
>> > To satisfy keyword X,y as well, I would presume that the BaseEstimator
>> would need to have a sanity check for error-raising default X,y values --
>> though does it not have many checks on X,y already?
>> >
>> > Not sure if everyone else agrees about keyword X and y, but just a
>> thought for consideration.
>> >
>> > Kind regards,
>> > J.B.
>> >
>> > 2018年11月15日(木) 18:34 Gael Varoquaux :
>> > I am really in favor of the general idea: it is much better to use named
>> > arguments for everybody (for readability, and to be less depend on
>> > parameter ordering).
>> >
>> > However, I would maintain that we need to move slowly with backward
>> > compatibility: changing in a backward-incompatible way a library brings
>> > much more loss than benefit to our users.
>> >
>> > So +1 for enforcing the change on all new arguments, but -1 for changing
>> > orders in the existing arguments any time soon.
>> >
>> > I agree that it would be good to push this change in existing models. We
>> > should probably announce it strongly well in advance, make sure that all
>> > our examples are changed (people copy-paste), wait a lot, and find a
>> > moment to squeeze this in.
>> >
>> > Gaël
>> >
>> > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote:
>> > > We could just announce that we will be making this a syntactic
>> constraint from
>> > > version X and make the change wholesale then. It would be less formal
>> backwards
>> > > compatibility than we usually 

Re: [scikit-learn] Next Sprint

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> We can also do Paris in April / May or June if that's ok with Joel and better
> for Andreas.

Absolutely.

My thoughts here are that I want to minimize transportation, partly
because flying has a large carbon footprint. Also, for personal reasons,
I am not sure that I will be able to make it to Austin in July, but I
realize that this is a pretty bad argument.

We're happy to try to host in Paris whenever it's most convenient and to
try to help with travel for those not in Paris.

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-11-20 Thread Olivier Grisel
We can also do Paris in April / May or June if that's ok with Joel and
better for Andreas.

I am teaching on Fridays from end of January to March. But I can miss half
a day of sprint to teach my class.

-- 
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn