Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Raghav R V
Hi Stuart Reynold, Like Jacob said we have an active PR at https://github.com/scikit-learn/scikit-learn/pull/5974 You could do git fetch https://github.com/raghavrv/scikit-learn.git missing_values_rf:missing_values_rf git checkout missing_values_rf python setup.py install And try it out. I

Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Dale T Smith
, Johns Creek, GA 30097 | dale.t.sm...@macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of Stuart Reynolds Sent: Thursday, October 13, 2016 2:14 PM To: scikit-learn@python.org Subject: [scikit-learn] Missing data and decision trees ⚠ EXT MSG

Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Raphael C
You can simply make a new binary feature (per feature that might have a missing value) that is 1 if the value is missing and 0 otherwise. The RF can then work out what to do with this information. I don't know how this compares in practice to more sophisticated approaches. Raphael On Thursday,

Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jason Rudy
It's not a decision tree, but py-earth may also do what you need. It handles missingness as described in section 3.4 here: http://media.salford-systems.com/library/MARS_V2_JHF_LCS-108.pdf. Basically, missingness is considered potentially predictive. On Thu, Oct 13, 2016 at 11:20 AM, Jeff

Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jeff
I ran into this several times as well with scikit-learn implementation of GBM. Look at xgboost if you have not already (is there someone out there that hasn't ? :)- it deals with missing values in the predictor space in a very eloquent manner.

Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Jacob Schreiber
I think Raghav is working on it in this PR: https://github.com/scikit-learn/scikit-learn/pull/5974 The reason they weren't initially supported is likely that it involves a lot of work and design choices to handle missing values appropriately, and the discussion on the best way to handle it was