[scikit-learn] time complexity of tree-based model?

2018-12-19 Thread lampahome
I do some benchmark in my experiments and I almost use ensemble-based
regressor.

What is the time complexity if I use random forest regressor? Assume I only
set variable * estimators=100* and others doesn't enter.

thx
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

2018-12-19 Thread Andreas Mueller



On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller >:



As far as I understand, the open PR is not a leave-one-out
TargetEncoder?

I would want it to be :-/

I also did not yet add the CountFeaturizer from that scikit-learn
PR, because it is actually quite different (e.g it doesn't work
for regression tasks, as it counts conditional on y). But for
classification it could be easily added to the benchmarks.

I'm confused now. That's what TargetEncoder and leave-one-out
TargetEncoder do as well, right?.


As far as I understand, that is not exactly what those do. The 
TargetEncoder (as implemented in dirty_cat, category_encoders and 
hccEncoders) will, for each category, calculate the expected value of 
the target depending on the category. For binary classification this 
indeed comes to counting the 0's and 1's, and there the information 
contained in the result might be similar as the sklearn PR, but the 
format is different: those packages calculate the probability (value 
between 0 and 1 as number of 1's divided by number of samples in that 
category) and return that as a single column, instead of returning two 
columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For 
multi-class you need multiple columns, right?

Doing a single column for binary makes sense, I think.

And for regression this is not related to counting anymore, but just 
the average of the target per category (in practice, the TargetEncoder 
is computing the same for regression or binary classification: the 
average of the target per category. But for regression, the 
CountFeaturizer doesn't work since there are no discrete values in the 
target to count).

I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same 
estimator shows that "CountFeaturizer"

is probably the wrong name.



Furthermore, all of those implementations in the 3 mentioned packages 
have some kind of regularization (empirical bayes shrinkage, or KFold 
or leave-one-out cross-validation), while this is also not present in 
the CountFeaturizer PR (but this aspect is of course something we want 
to actually test in the benchmarks).


Another thing I noticed in the CountFeaturizer implementation, is that 
the behaviour differs when y is passed or not. First, I find it a bit 
strange to do this as it is a quite different behaviour (counting the 
categories (to just encode the categorical variable with a notion 
about its frequency in the training set), or counting the target 
depending on the category is quite different?). But also, when using a 
transformer in a Pipeline, you don't control the passing of y, I 
think? So in that way, you always have the behaviour of counting the 
target.
I would find it more logical to have those two things in two separate 
transformers (if we think the "frequency encoder" is useful enough).
(I need to give this feedback on the PR, but that will be for after 
the holidays)


I'm pretty sure I mentioned that before, I think optional y is bad. I 
just thought it was weird but the pipeline argument is a good one.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-12-19 Thread Andreas Mueller

Can we please nail down dates for a sprint?

On 11/20/18 2:25 PM, Gael Varoquaux wrote:

On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:

We can also do Paris in April / May or June if that's ok with Joel and better
for Andreas.

Absolutely.

My thoughts here are that I want to minimize transportation, partly
because flying has a large carbon footprint. Also, for personal reasons,
I am not sure that I will be able to make it to Austin in July, but I
realize that this is a pretty bad argument.

We're happy to try to host in Paris whenever it's most convenient and to
try to help with travel for those not in Paris.

Gaƫl
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn