Re: [Scikit-learn-general] Weighted and Balanced Random Forests
I have a follow up question regarding the usage of sample_weights for fitting the RandomForestClassifier. Does the predict_proba method take the sample weights (used during fitting) into account as well? I spent some time trying to understand the _tree.pyc and tree.py files in the codebase but still I am a little fuzzy about how the predict_proba code works when the sample_weights are present. I have an unbalanced data set (1:12 ratio) and I find that the probabilities are highly skewed towards the majority class even after using sample weights. I am planning to use isotonic regression to calibrate my predictions but it will be nice to have a less skewed input into the calibration algorithms. On Thu, Feb 7, 2013 at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote: Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced. https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221 (This should appear in the reference, I'll fix that) Hope this helps, Gilles On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote: Fellow sklearners, I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option. I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using Weighted or Balanced Random Forests as suggested in this paper. http://www.stat.berkeley.edu/tech-reports/666.pdf I don't see any implementation of these options but I might be mistaken so I wanted to ask the community. Also, I am willing to write code and contribute back if this will be useful to other folks. I have also thought about balancing the data using up/down sampling the minority/majority class (with or without replacement) and even SMOTE but couldn't find those implementation in the scikit-learn library yet. The modified Random Forests seem to outperform these methods according to the paper, hence I am interested in trying those first. -Manish -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Weighted and Balanced Random Forests
I've been wrestling with this same issue in the regression case. I realize it's not as straight forward to balance continuous data as it is for discrete classes of output. But I wonder if this list has any thoughts about how it might be approached. The data I'm predicting is distributed normally and particularly when sample sizes are small the tails tend to be neglected and poorly predicted. Thoughts? On Fri, Feb 8, 2013 at 2:44 AM, Manish Amde manish...@gmail.com wrote: Thanks Gilles. This definitely helps. I am glad I asked. :-) -Manish On Feb 7, 2013, at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote: Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced. https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221 (This should appear in the reference, I'll fix that) Hope this helps, Gilles On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote: Fellow sklearners, I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option. I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using Weighted or Balanced Random Forests as suggested in this paper. http://www.stat.berkeley.edu/tech-reports/666.pdf I don't see any implementation of these options but I might be mistaken so I wanted to ask the community. Also, I am willing to write code and contribute back if this will be useful to other folks. I have also thought about balancing the data using up/down sampling the minority/majority class (with or without replacement) and even SMOTE but couldn't find those implementation in the scikit-learn library yet. The modified Random Forests seem to outperform these methods according to the paper, hence I am interested in trying those first. -Manish -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Weighted and Balanced Random Forests
Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced. https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221 (This should appear in the reference, I'll fix that) Hope this helps, Gilles On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote: Fellow sklearners, I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option. I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using Weighted or Balanced Random Forests as suggested in this paper. http://www.stat.berkeley.edu/tech-reports/666.pdf I don't see any implementation of these options but I might be mistaken so I wanted to ask the community. Also, I am willing to write code and contribute back if this will be useful to other folks. I have also thought about balancing the data using up/down sampling the minority/majority class (with or without replacement) and even SMOTE but couldn't find those implementation in the scikit-learn library yet. The modified Random Forests seem to outperform these methods according to the paper, hence I am interested in trying those first. -Manish -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Weighted and Balanced Random Forests
Thanks Gilles. This definitely helps. I am glad I asked. :-) -Manish On Feb 7, 2013, at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote: Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced. https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221 (This should appear in the reference, I'll fix that) Hope this helps, Gilles On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote: Fellow sklearners, I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option. I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using Weighted or Balanced Random Forests as suggested in this paper. http://www.stat.berkeley.edu/tech-reports/666.pdf I don't see any implementation of these options but I might be mistaken so I wanted to ask the community. Also, I am willing to write code and contribute back if this will be useful to other folks. I have also thought about balancing the data using up/down sampling the minority/majority class (with or without replacement) and even SMOTE but couldn't find those implementation in the scikit-learn library yet. The modified Random Forests seem to outperform these methods according to the paper, hence I am interested in trying those first. -Manish -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general