[scikit-learn] Fairness Metrics

2018-10-28 Thread Feldman, Joshua
Hi,

I was wondering if there's any interest in adding fairness metrics to
sklearn. Specifically, I was thinking of implementing the metrics described
here:

https://dsapp.uchicago.edu/projects/aequitas/

I recognize that these metrics are extremely simple to calculate, but given
that sklearn is the standard machine learning package in python, I think it
would be very powerful to explicitly include algorithmic fairness - it
would make these methods more accessible and, as a matter of principle,
demonstrate that ethics is part of ML and not an afterthought. I would love
to hear the groups' thoughts and if there's interest in such a feature.

Thanks!

Josh
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-10-28 Thread Andreas Mueller


On 10/24/18 4:11 AM, Manuel Castejón Limas wrote:

Dear all,
as a way of improving the documentation of PipeGraph we intend to 
provide more examples of its usage. It was a popular demand to show 
application cases to motivate its usage, so here it is a very simple 
case with two steps: a KMeans followed by a LDA.


https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py

This short example points out the following challenges:
- KMeans is not a transformer but an estimator


KMeans is a transformer in sklearn: 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform


(you can't get the labels to be the output which is what you're doing 
here, but it is a transformer)


- LDA score function requires the y parameter, while its input does 
not come from a known set of labels, but from the previous KMeans

- Moreover, the GridSearchCV.fit call would also require a 'y' parameter


Not true if you provide a scoring that doesn't require y or if you don't 
specify scoring and the scoring method of the estimator doesn't require y.


GridSearchCV.fit doesn't require y.

- It would be nice to have access to the output of the KMeans step as 
well.


PipeGraph is capable of addressing these challenges.

The rationale for this example lies in the 
identification-reconstruction realm. In a scenario where the class 
labels are unknown, we might want to associate the quality of the 
clustering structure to the capability of a later model to be able to 
reconstruct this structure. So the basic idea here is that if LDA is 
capable of getting good results it was because the information of the 
KMeans was good enough for that purpose, hinting the discovery of a 
good structure.


Can you provide a citation for that? That seems to heavily depend on the 
clustering algorithms and the classifier.

To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075

This does seem interesting as well, though, haven't thought about this.

It's cool that this is possible, but I feel this is still not really a 
"killer application" in that this is not a very common pattern.


Also you could replicate something similar in sklearn with

def estimator_scorer(testing_estimator):
    def my_scorer(estimator, X, y=None):
        y = estimator.predict(X)

    return np.mean(cross_val_score(testing_estimator, X, y))

Though using that we'd be doing nested cross-validation on the test set...
That's a bit of an issue in the current GridSearchCV implementation :-/ 
There's an issue by Joel somewhere
to implement something that allows training without splitting which is 
what you'd want here.
You could run the outer grid-search with a custom cross-validation 
iterator that returns all indices as training and test set and only does 
a single split, though...


class NoSplitCV(object):

    def split(self, X, y, class_weights):

    indices = np.arange(_num_samples(X))
    yield indices, indices

Though I acknowledge that your code only takes 4 lines, while mine takes 
8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4 
lines :P)


I think pipegraph is cool, not meaning to give you a hard time ;)

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Strange code but that works

2018-10-28 Thread Joel Nothman
Be careful: that @property is very significant here. It means that this is
a description of how to *get* the method, not how to *run* the method. You
will notice, for instance, that it says `def transform(self)`, not `def
transform(self, X)`
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
That's nice to know, thanks a lot for the reference!

Best,
Sebastian

> On Oct 28, 2018, at 3:34 AM, Guillaume Lemaître  
> wrote:
> 
> FYI: https://github.com/scikit-learn/scikit-learn/pull/12364
> 
> On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître  
> wrote:
> There is always a shuffling when iteration over the features (even when going 
> to all features).
> So in the case of a tie the split will be done on the first feature encounter 
> which will be different due to the shuffling.
> 
> There is a PR which was intending to make the algorithm deterministic to 
> always select the same feature in the case of tie.
> 
> On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann 
>  wrote:
> The random_state is used in the splitters:
> 
> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
> 
> splitter = self.splitter
> if not isinstance(self.splitter, Splitter):
> splitter = SPLITTERS[self.splitter](criterion,
> self.max_features_,
> min_samples_leaf,
> min_weight_leaf,
> random_state,
> self.presort)
> 
> Which is defined as:
> 
> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>"random": _splitter.RandomSplitter}
> 
> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
> "random": _splitter.RandomSparseSplitter}
> 
> Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 
> 'best' as default `splitter` parameter. I am not sure how this 'best' 
> strategy was defined. The docs define as "Supported strategies are “best”. 
> 
> 
> 
> 
> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański  wrote:
> Just a small side note that I've come across with Random Forests which in the 
> end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on 
> multi-label data and managed to get a 4-10 percentage points difference in 
> subset accuracy, depending on the data set, just as a random effect, while 
> I've seen papers report differences of just a couple pp as statistically 
> significant after a non-parametric rank test. 
> 
> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka  
> wrote:
> Good suggestion. The trees look different. I.e., there seems to be a tie at 
> some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
> 
> So, I suspect that the features are shuffled, let's call it X_shuffled. Then 
> at some point the max_features are selected, which is by default 
> X_shuffled[:, :n_features]. Based on that, if there's a tie between 
> impurities for the different features, it's probably selecting the first 
> feature in the array among these ties.
> 
> If this is true (have to look into the code more deeply then) I wonder if it 
> would be worthwhile to change the implementation such that the shuffling only 
> occurs if  max_features < n_feature, because this way we could have 
> deterministic behavior for the trees by default, which I'd find more 
> intuitive for plain decision trees tbh.
> 
> Let me know what you all think.
> 
> Best,
> Sebastian
> 
> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente 
> >  wrote:
> > 
> > Hmmm that’s weird...
> > 
> > Have you tried to plot the trees (the decision rules) for the tree with 
> > different seeds, and see if the gain for the first split is the same even 
> > if the split itself is different?
> > 
> > I’d at least try that before diving into the source code...
> > 
> > Cheers,
> > 
> > --
> > Julio
> > 
> >> El 28 oct 2018, a las 2:24, Sebastian Raschka  
> >> escribió:
> >> 
> >> Thanks, Javier,
> >> 
> >> however, the max_features is n_features by default. But if you execute sth 
> >> like
> >> 
> >> import numpy as np
> >> from sklearn.datasets import load_iris
> >> from sklearn.model_selection import train_test_split
> >> from sklearn.tree import DecisionTreeClassifier
> >> 
> >> iris = load_iris()
> >> X, y = iris.data, iris.target
> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
> >>   test_size=0.3,
> >>   random_state=123,
> >>   shuffle=True,
> >>   stratify=y)
> >> 
> >> for i in range(20):
> >>   tree = DecisionTreeClassifier()
> >>   tree.fit(X_train, y_train)
> >>   print(tree.score(X_test, y_test))
> >> 
> >> 
> >> 
> >> You will find that the tree will produce different results if you don't 
> >> fix the random seed. I suspect, related to what you said about the random 
> >> feature selection if max_features is not n_features, that there is 
> >> generally some sorting of the features going on, and the different trees 
> >> are then due to tie-breaking if two 

Re: [scikit-learn] Strange code but that works

2018-10-28 Thread Guillaume Lemaître
On Sun, 28 Oct 2018 at 07:42, Louis Abraham via scikit-learn <
scikit-learn@python.org> wrote:

> Hi,
>
> This is a code from sklearn.pipeline.Pipeline:
> @property
> def transform(self):
> """Apply transforms, and transform with the final estimator
>
> This also works where final estimator is ``None``: all prior
> transformations are applied.
>
> Parameters
> --
> X : iterable
> Data to transform. Must fulfill input requirements of first step
> of the pipeline.
>
> Returns
> ---
> Xt : array-like, shape = [n_samples, n_transformed_features]
> """
> # _final_estimator is None or has transform, otherwise attribute error
> # XXX: Handling the None case means we can't use if_delegate_has_method
> if self._final_estimator is not None:
> self._final_estimator.transform
> return self._transform
>
> I don't understand why `self._final_estimator.transform` can be returned,
> ignoring all the previous transformers.
>

It is not returned. It is called such that if the final estimator does not
implement a transform method then it will raise an error.
Otherwise, _transform is called, which is actually perform all the
transform of all transformer (except the one that are set to None)
This is actually what the comment is referring to above (_final_estimator
is None or has transform, otherwise attribute error).



> However, when testing it works:
>
> ```
> >>> p = make_pipeline(FunctionTransformer(lambda x: 2*x),
> FunctionTransformer(lambda x: x-1))
> >>> p.transform(np.array([[1,2]]))
> array([[1, 3]])
> ```
>
> Could somebody explain that to me?
>
> Best,
> Louis Abraham
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Question about get_params / set_params

2018-10-28 Thread Guillaume Lemaître
On Sun, 28 Oct 2018 at 09:31, Louis Abraham via scikit-learn <
scikit-learn@python.org> wrote:

> Hi,
>
> According to
> http://scikit-learn.org/0.16/developers/index.html#get-params-and-set-params
> ,
> get_params and set_params are used to clone estimators.
>

sklearn.base.clone is function used for cloning. get_params and set_params
are accessors to attributes of an estimator and are defined by
BaseEstimator.
For Pipeline and FeatureUnion, those accessors rely on the _BaseComposition
which manage the access to attributes to the sub-estimators.


> However, I don't understand how it is used in FeatureUnion:
> `return self._get_params('transformer_list', deep=deep)`
>

transformer_list contain all the estimators used in the FeatureUnion, and
the _BaseComposition allow you to access the parameters of each transformer.


>
> Why doesn't it contain other arguments like n_jobs and transformer_weights?
>

The first line in _get_params in _BaseCompositin will list the attributes
of FeatureUnion;
https://github.com/scikit-learn/scikit-learn/blob/06ac22d06f54353ea5d5bba244371474c7baf938/sklearn/utils/metaestimators.py#L26

For instance:

In [5]: trans = FeatureUnion([('trans1', StandardScaler()), ('trans2',
MinMaxScaler())])


In [6]:
trans.get_params()

Out[6]:
{'n_jobs': None,
 'transformer_list': [('trans1',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('trans2', MinMaxScaler(copy=True, feature_range=(0, 1)))],
 'transformer_weights': None,
 'trans1': StandardScaler(copy=True, with_mean=True, with_std=True),
 'trans2': MinMaxScaler(copy=True, feature_range=(0, 1)),
 'trans1__copy': True,
 'trans1__with_mean': True,
 'trans1__with_std': True,
 'trans2__copy': True,
 'trans2__feature_range': (0, 1)}

Then, n_jobs and transformer_weights are accessible.


>
> Best
> Louis
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Guillaume Lemaître
FYI: https://github.com/scikit-learn/scikit-learn/pull/12364

On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître 
wrote:

> There is always a shuffling when iteration over the features (even when
> going to all features).
> So in the case of a tie the split will be done on the first feature
> encounter which will be different due to the shuffling.
>
> There is a PR which was intending to make the algorithm deterministic to
> always select the same feature in the case of tie.
>
> On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann <
> fernando.wittm...@gmail.com> wrote:
>
>> The random_state is used in the splitters:
>>
>> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
>>
>> splitter = self.splitter
>> if not isinstance(self.splitter, Splitter):
>> splitter = SPLITTERS[self.splitter](criterion,
>> self.max_features_,
>> min_samples_leaf,
>> min_weight_leaf,
>> random_state,
>> self.presort)
>>
>> Which is defined as:
>>
>> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>>"random": _splitter.RandomSplitter}
>>
>> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
>> "random": _splitter.RandomSparseSplitter}
>>
>>
>> Both 'best' and 'random' uses random states. The DecisionTreeClassifier
>> uses 'best' as default `splitter` parameter. I am not sure how this 'best'
>> strategy was defined. The docs define as "Supported strategies are “best”.
>>
>>
>>
>>
>> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański 
>> wrote:
>>
>>> Just a small side note that I've come across with Random Forests which
>>> in the end form an ensemble of Decision Trees. I ran a thousand iterations
>>> of RFs on multi-label data and managed to get a 4-10 percentage points
>>> difference in subset accuracy, depending on the data set, just as a random
>>> effect, while I've seen papers report differences of just a couple pp as
>>> statistically significant after a non-parametric rank test.
>>>
>>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <
>>> m...@sebastianraschka.com> wrote:
>>>
 Good suggestion. The trees look different. I.e., there seems to be a
 tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65

 So, I suspect that the features are shuffled, let's call it X_shuffled.
 Then at some point the max_features are selected, which is by default
 X_shuffled[:, :n_features]. Based on that, if there's a tie between
 impurities for the different features, it's probably selecting the first
 feature in the array among these ties.

 If this is true (have to look into the code more deeply then) I wonder
 if it would be worthwhile to change the implementation such that the
 shuffling only occurs if  max_features < n_feature, because this way we
 could have deterministic behavior for the trees by default, which I'd find
 more intuitive for plain decision trees tbh.

 Let me know what you all think.

 Best,
 Sebastian

 > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
 ju...@esbet.es> wrote:
 >
 > Hmmm that’s weird...
 >
 > Have you tried to plot the trees (the decision rules) for the tree
 with different seeds, and see if the gain for the first split is the same
 even if the split itself is different?
 >
 > I’d at least try that before diving into the source code...
 >
 > Cheers,
 >
 > --
 > Julio
 >
 >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
 m...@sebastianraschka.com> escribió:
 >>
 >> Thanks, Javier,
 >>
 >> however, the max_features is n_features by default. But if you
 execute sth like
 >>
 >> import numpy as np
 >> from sklearn.datasets import load_iris
 >> from sklearn.model_selection import train_test_split
 >> from sklearn.tree import DecisionTreeClassifier
 >>
 >> iris = load_iris()
 >> X, y = iris.data, iris.target
 >> X_train, X_test, y_train, y_test = train_test_split(X, y,
 >>   test_size=0.3,
 >>   random_state=123,
 >>   shuffle=True,
 >>   stratify=y)
 >>
 >> for i in range(20):
 >>   tree = DecisionTreeClassifier()
 >>   tree.fit(X_train, y_train)
 >>   print(tree.score(X_test, y_test))
 >>
 >>
 >>
 >> You will find that the tree will produce different results if you
 don't fix the random seed. I suspect, related to what you said about the
 random feature selection if max_features is not 

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Guillaume Lemaître
There is always a shuffling when iteration over the features (even when
going to all features).
So in the case of a tie the split will be done on the first feature
encounter which will be different due to the shuffling.

There is a PR which was intending to make the algorithm deterministic to
always select the same feature in the case of tie.

On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann <
fernando.wittm...@gmail.com> wrote:

> The random_state is used in the splitters:
>
> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
>
> splitter = self.splitter
> if not isinstance(self.splitter, Splitter):
> splitter = SPLITTERS[self.splitter](criterion,
> self.max_features_,
> min_samples_leaf,
> min_weight_leaf,
> random_state,
> self.presort)
>
> Which is defined as:
>
> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>"random": _splitter.RandomSplitter}
>
> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
> "random": _splitter.RandomSparseSplitter}
>
>
> Both 'best' and 'random' uses random states. The DecisionTreeClassifier
> uses 'best' as default `splitter` parameter. I am not sure how this 'best'
> strategy was defined. The docs define as "Supported strategies are “best”.
>
>
>
>
> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański  wrote:
>
>> Just a small side note that I've come across with Random Forests which in
>> the end form an ensemble of Decision Trees. I ran a thousand iterations of
>> RFs on multi-label data and managed to get a 4-10 percentage points
>> difference in subset accuracy, depending on the data set, just as a random
>> effect, while I've seen papers report differences of just a couple pp as
>> statistically significant after a non-parametric rank test.
>>
>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <
>> m...@sebastianraschka.com> wrote:
>>
>>> Good suggestion. The trees look different. I.e., there seems to be a tie
>>> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>>>
>>> So, I suspect that the features are shuffled, let's call it X_shuffled.
>>> Then at some point the max_features are selected, which is by default
>>> X_shuffled[:, :n_features]. Based on that, if there's a tie between
>>> impurities for the different features, it's probably selecting the first
>>> feature in the array among these ties.
>>>
>>> If this is true (have to look into the code more deeply then) I wonder
>>> if it would be worthwhile to change the implementation such that the
>>> shuffling only occurs if  max_features < n_feature, because this way we
>>> could have deterministic behavior for the trees by default, which I'd find
>>> more intuitive for plain decision trees tbh.
>>>
>>> Let me know what you all think.
>>>
>>> Best,
>>> Sebastian
>>>
>>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
>>> ju...@esbet.es> wrote:
>>> >
>>> > Hmmm that’s weird...
>>> >
>>> > Have you tried to plot the trees (the decision rules) for the tree
>>> with different seeds, and see if the gain for the first split is the same
>>> even if the split itself is different?
>>> >
>>> > I’d at least try that before diving into the source code...
>>> >
>>> > Cheers,
>>> >
>>> > --
>>> > Julio
>>> >
>>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
>>> m...@sebastianraschka.com> escribió:
>>> >>
>>> >> Thanks, Javier,
>>> >>
>>> >> however, the max_features is n_features by default. But if you
>>> execute sth like
>>> >>
>>> >> import numpy as np
>>> >> from sklearn.datasets import load_iris
>>> >> from sklearn.model_selection import train_test_split
>>> >> from sklearn.tree import DecisionTreeClassifier
>>> >>
>>> >> iris = load_iris()
>>> >> X, y = iris.data, iris.target
>>> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
>>> >>   test_size=0.3,
>>> >>   random_state=123,
>>> >>   shuffle=True,
>>> >>   stratify=y)
>>> >>
>>> >> for i in range(20):
>>> >>   tree = DecisionTreeClassifier()
>>> >>   tree.fit(X_train, y_train)
>>> >>   print(tree.score(X_test, y_test))
>>> >>
>>> >>
>>> >>
>>> >> You will find that the tree will produce different results if you
>>> don't fix the random seed. I suspect, related to what you said about the
>>> random feature selection if max_features is not n_features, that there is
>>> generally some sorting of the features going on, and the different trees
>>> are then due to tie-breaking if two features have the same information gain?
>>> >>
>>> >> Best,
>>> >> Sebastian
>>> >>
>>> >>
>>> >>
>>> >>> On 

[scikit-learn] Question about get_params / set_params

2018-10-28 Thread Louis Abraham via scikit-learn
Hi,

According to 
http://scikit-learn.org/0.16/developers/index.html#get-params-and-set-params 
,
get_params and set_params are used to clone estimators.
However, I don't understand how it is used in FeatureUnion:
`return self._get_params('transformer_list', deep=deep)`

Why doesn't it contain other arguments like n_jobs and transformer_weights?

Best
Louis

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Piotr Szymański
Just a small side note that I've come across with Random Forests which in
the end form an ensemble of Decision Trees. I ran a thousand iterations of
RFs on multi-label data and managed to get a 4-10 percentage points
difference in subset accuracy, depending on the data set, just as a random
effect, while I've seen papers report differences of just a couple pp as
statistically significant after a non-parametric rank test.

On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka 
wrote:

> Good suggestion. The trees look different. I.e., there seems to be a tie
> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>
> So, I suspect that the features are shuffled, let's call it X_shuffled.
> Then at some point the max_features are selected, which is by default
> X_shuffled[:, :n_features]. Based on that, if there's a tie between
> impurities for the different features, it's probably selecting the first
> feature in the array among these ties.
>
> If this is true (have to look into the code more deeply then) I wonder if
> it would be worthwhile to change the implementation such that the shuffling
> only occurs if  max_features < n_feature, because this way we could have
> deterministic behavior for the trees by default, which I'd find more
> intuitive for plain decision trees tbh.
>
> Let me know what you all think.
>
> Best,
> Sebastian
>
> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
> ju...@esbet.es> wrote:
> >
> > Hmmm that’s weird...
> >
> > Have you tried to plot the trees (the decision rules) for the tree with
> different seeds, and see if the gain for the first split is the same even
> if the split itself is different?
> >
> > I’d at least try that before diving into the source code...
> >
> > Cheers,
> >
> > --
> > Julio
> >
> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
> m...@sebastianraschka.com> escribió:
> >>
> >> Thanks, Javier,
> >>
> >> however, the max_features is n_features by default. But if you execute
> sth like
> >>
> >> import numpy as np
> >> from sklearn.datasets import load_iris
> >> from sklearn.model_selection import train_test_split
> >> from sklearn.tree import DecisionTreeClassifier
> >>
> >> iris = load_iris()
> >> X, y = iris.data, iris.target
> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
> >>   test_size=0.3,
> >>   random_state=123,
> >>   shuffle=True,
> >>   stratify=y)
> >>
> >> for i in range(20):
> >>   tree = DecisionTreeClassifier()
> >>   tree.fit(X_train, y_train)
> >>   print(tree.score(X_test, y_test))
> >>
> >>
> >>
> >> You will find that the tree will produce different results if you don't
> fix the random seed. I suspect, related to what you said about the random
> feature selection if max_features is not n_features, that there is
> generally some sorting of the features going on, and the different trees
> are then due to tie-breaking if two features have the same information gain?
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >>
> >>> On Oct 27, 2018, at 6:16 PM, Javier López  wrote:
> >>>
> >>> Hi Sebastian,
> >>>
> >>> I think the random state is used to select the features that go into
> each split (look at the `max_features` parameter)
> >>>
> >>> Cheers,
> >>> Javier
> >>>
> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <
> m...@sebastianraschka.com> wrote:
> >>> Hi all,
> >>>
> >>> when I was implementing a bagging classifier based on scikit-learn's
> DecisionTreeClassifier, I noticed that the results were not deterministic
> and found that this was due to the random_state in the
> DescisionTreeClassifier (which is set to None by default).
> >>>
> >>> I am wondering what exactly this random state is used for? I can
> imaging it being used for resolving ties if the information gain for
> multiple features is the same, or it could be that the feature splits of
> continuous features is different? (I thought the heuristic is to sort the
> features and to consider those feature values next to each associated with
> examples that have different class labels -- but is there maybe some random
> subselection involved?)
> >>>
> >>> If someone knows more about this, where the random_state is used, I'd
> be happy to hear it :)
> >>>
> >>> Also, we could then maybe add the info to the DecisionTreeClassifier's
> docstring, which is currently a bit too generic to be useful, I think:
> >>>
> >>>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
> >>>
> >>>
> >>>   random_state : int, RandomState instance or None, optional
> (default=None)
> >>>   If int, random_state is the seed used by the random number
> generator;
> >>>   If RandomState instance, random_state is the random number
> generator;
> >>>   If None, the random number generator is the RandomState instance
> used
> >>>

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
Good suggestion. The trees look different. I.e., there seems to be a tie at 
some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65

So, I suspect that the features are shuffled, let's call it X_shuffled. Then at 
some point the max_features are selected, which is by default X_shuffled[:, 
:n_features]. Based on that, if there's a tie between impurities for the 
different features, it's probably selecting the first feature in the array 
among these ties.

If this is true (have to look into the code more deeply then) I wonder if it 
would be worthwhile to change the implementation such that the shuffling only 
occurs if  max_features < n_feature, because this way we could have 
deterministic behavior for the trees by default, which I'd find more intuitive 
for plain decision trees tbh.

Let me know what you all think.

Best,
Sebastian

> On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente  
> wrote:
> 
> Hmmm that’s weird...
> 
> Have you tried to plot the trees (the decision rules) for the tree with 
> different seeds, and see if the gain for the first split is the same even if 
> the split itself is different?
> 
> I’d at least try that before diving into the source code...
> 
> Cheers,
> 
> --
> Julio
> 
>> El 28 oct 2018, a las 2:24, Sebastian Raschka  
>> escribió:
>> 
>> Thanks, Javier,
>> 
>> however, the max_features is n_features by default. But if you execute sth 
>> like
>> 
>> import numpy as np
>> from sklearn.datasets import load_iris
>> from sklearn.model_selection import train_test_split
>> from sklearn.tree import DecisionTreeClassifier
>> 
>> iris = load_iris()
>> X, y = iris.data, iris.target
>> X_train, X_test, y_train, y_test = train_test_split(X, y,
>>   test_size=0.3,
>>   random_state=123,
>>   shuffle=True,
>>   stratify=y)
>> 
>> for i in range(20):
>>   tree = DecisionTreeClassifier()
>>   tree.fit(X_train, y_train)
>>   print(tree.score(X_test, y_test))
>> 
>> 
>> 
>> You will find that the tree will produce different results if you don't fix 
>> the random seed. I suspect, related to what you said about the random 
>> feature selection if max_features is not n_features, that there is generally 
>> some sorting of the features going on, and the different trees are then due 
>> to tie-breaking if two features have the same information gain?
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>>> On Oct 27, 2018, at 6:16 PM, Javier López  wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> I think the random state is used to select the features that go into each 
>>> split (look at the `max_features` parameter)
>>> 
>>> Cheers,
>>> Javier
>>> 
>>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
>>>  wrote:
>>> Hi all,
>>> 
>>> when I was implementing a bagging classifier based on scikit-learn's 
>>> DecisionTreeClassifier, I noticed that the results were not deterministic 
>>> and found that this was due to the random_state in the 
>>> DescisionTreeClassifier (which is set to None by default).
>>> 
>>> I am wondering what exactly this random state is used for? I can imaging it 
>>> being used for resolving ties if the information gain for multiple features 
>>> is the same, or it could be that the feature splits of continuous features 
>>> is different? (I thought the heuristic is to sort the features and to 
>>> consider those feature values next to each associated with examples that 
>>> have different class labels -- but is there maybe some random subselection 
>>> involved?)
>>> 
>>> If someone knows more about this, where the random_state is used, I'd be 
>>> happy to hear it :)
>>> 
>>> Also, we could then maybe add the info to the DecisionTreeClassifier's 
>>> docstring, which is currently a bit too generic to be useful, I think:
>>> 
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>>> 
>>> 
>>>   random_state : int, RandomState instance or None, optional (default=None)
>>>   If int, random_state is the seed used by the random number generator;
>>>   If RandomState instance, random_state is the random number generator;
>>>   If None, the random number generator is the RandomState instance used
>>>   by `np.random`.
>>> 
>>> 
>>> Best,
>>> Sebastian
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> 

[scikit-learn] Strange code but that works

2018-10-28 Thread Louis Abraham via scikit-learn
Hi,

This is a code from sklearn.pipeline.Pipeline:
@property
def transform(self):
"""Apply transforms, and transform with the final estimator

This also works where final estimator is ``None``: all prior
transformations are applied.

Parameters
--
X : iterable
Data to transform. Must fulfill input requirements of first step
of the pipeline.

Returns
---
Xt : array-like, shape = [n_samples, n_transformed_features]
"""
# _final_estimator is None or has transform, otherwise attribute error
# XXX: Handling the None case means we can't use if_delegate_has_method
if self._final_estimator is not None:
self._final_estimator.transform
return self._transform

I don't understand why `self._final_estimator.transform` can be returned, 
ignoring all the previous transformers.
However, when testing it works:

```
>>> p = make_pipeline(FunctionTransformer(lambda x: 2*x), 
>>> FunctionTransformer(lambda x: x-1))
>>> p.transform(np.array([[1,2]]))
array([[1, 3]])
```

Could somebody explain that to me?

Best,
Louis Abraham

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn