Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Raghav R V
Hi Stuart Reynold,

Like Jacob said we have an active PR at
https://github.com/scikit-learn/scikit-learn/pull/5974

You could do

git fetch https://github.com/raghavrv/scikit-learn.git
missing_values_rf:missing_values_rf
git checkout missing_values_rf
python setup.py install

And try it out. I warn you thought, there are some memory leaks I'm trying
to debug. But for the most part it works well and outperforms basic
imputation techniques.

Please let us know if it breaks / not solves your usecase. Your input as a
user of that feature would be invaluable!

> I ran into this several times as well with scikit-learn implementation of
GBM. Look at xgboost if you have not already (is there someone out there
that hasn't ? :)- it deals with missing values in the predictor space in a
very eloquent manner. http://xgboost.readthedocs.io/
en/latest/python/python_intro.html


The PR handles it in a conceptually similar approach. It is currently
implemented for DecisionTreeClassifier. After reviews and integration,
DecisionTreeRegressor would also be supporting missing values. Once that
happens, enabling it in gradient boosting will be possible.

Thanks for the interest!!

On Thu, Oct 13, 2016 at 8:33 PM, Raphael C  wrote:

> You can simply make a new binary feature (per feature that might have a
> missing value) that is 1 if the value is missing and 0 otherwise.  The RF
> can then work out what to do with this information.
>
> I don't know how this compares in practice to more sophisticated
> approaches.
>
> Raphael
>
>
> On Thursday, October 13, 2016, Stuart Reynolds 
> wrote:
>
>> I'm looking for a decision tree and RF implementation that supports
>> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>>
>> It seems that scikit's decision tree algorithm doesn't allow this --
>> which is disappointing because its one of the few methods that should be
>> able to sensibly handle problems with high amounts of missingness.
>>
>> Are there plans to allow missing data in scikit's decision trees?
>>
>> Also, is there any particular reason why missing values weren't supported
>> originally (e.g. integrates poorly with other features)
>>
>> Regards
>> - Stuart
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Dale T Smith
Please define “sensibly”. I would be strongly opposed to modifying any models 
to incorporate “missingness”. No model handles missing data for you. That is 
for you to decide based on your individual problem domain.

Take a look at a talk from last winter on missing data by Nina Zumel. Nina 
defines “sensibly” in several ways.

https://www.r-bloggers.com/prepping-data-for-analysis-using-r/



__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | 
dale.t.sm...@macys.com

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
Stuart Reynolds
Sent: Thursday, October 13, 2016 2:14 PM
To: scikit-learn@python.org
Subject: [scikit-learn] Missing data and decision trees

⚠ EXT MSG:
I'm looking for a decision tree and RF implementation that supports missing 
data (without imputation) -- ideally in Python, Java/Scala or C++.

It seems that scikit's decision tree algorithm doesn't allow this -- which is 
disappointing because its one of the few methods that should be able to 
sensibly handle problems with high amounts of missingness.

Are there plans to allow missing data in scikit's decision trees?

Also, is there any particular reason why missing values weren't supported 
originally (e.g. integrates poorly with other features)

Regards
- Stuart
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
attachments.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Raphael C
You can simply make a new binary feature (per feature that might have a
missing value) that is 1 if the value is missing and 0 otherwise.  The RF
can then work out what to do with this information.

I don't know how this compares in practice to more sophisticated approaches.

Raphael

On Thursday, October 13, 2016, Stuart Reynolds 
wrote:

> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jason Rudy
It's not a decision tree, but py-earth may also do what you need.  It
handles missingness as described in section 3.4 here:
http://media.salford-systems.com/library/MARS_V2_JHF_LCS-108.pdf.
Basically, missingness is considered potentially predictive.

On Thu, Oct 13, 2016 at 11:20 AM, Jeff  wrote:

> I ran into this several times as well with scikit-learn implementation of
> GBM. Look at xgboost if you have not already (is there someone out there
> that hasn't ? :)- it deals with missing values in the predictor space in a
> very eloquent manner.
>
> http://xgboost.readthedocs.io/en/latest/python/python_intro.html
>
> https://arxiv.org/abs/1603.02754
>
>
> Jeff
>
>
>
> On 10/13/2016 2:14 PM, Stuart Reynolds wrote:
>
> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jeff
I ran into this several times as well with scikit-learn implementation 
of GBM. Look at xgboost if you have not already (is there someone out 
there that hasn't ? :)- it deals with missing values in the predictor 
space in a very eloquent manner.


http://xgboost.readthedocs.io/en/latest/python/python_intro.html

https://arxiv.org/abs/1603.02754


Jeff



On 10/13/2016 2:14 PM, Stuart Reynolds wrote:
I'm looking for a decision tree and RF implementation that supports 
missing data (without imputation) -- ideally in Python, Java/Scala or 
C++.


It seems that scikit's decision tree algorithm doesn't allow this -- 
which is disappointing because its one of the few methods that should 
be able to sensibly handle problems with high amounts of missingness.


Are there plans to allow missing data in scikit's decision trees?

Also, is there any particular reason why missing values weren't 
supported originally (e.g. integrates poorly with other features)


Regards
- Stuart


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jacob Schreiber
I think Raghav is working on it in this PR:
https://github.com/scikit-learn/scikit-learn/pull/5974

The reason they weren't initially supported is likely that it involves a
lot of work and design choices to handle missing values appropriately, and
the discussion on the best way to handle it was postponed until there was a
working estimator which could serve most peoples needs.

On Thu, Oct 13, 2016 at 11:14 AM, Stuart Reynolds  wrote:

> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn