Re: [Scikit-learn-general] Trees with unbalanced classes
Sergey Feldman sergeyfeldman@... writes: Hi Sergey, I am having the exact same issue with my data set except its 80 dimensional and I only have two classes where the classes are unbalanced about 5:1. I am also using random forest in sklearn. Just curious how well did using Manish's suggestion with sklearn help with your classifier? Thank you!! Michael Thanks, Manish! Exactly what I was looking for. On Fri, Jul 12, 2013 at 4:52 PM, Manish Amde manish9ue- re5jqeeqqe8avxtiumw...@public.gmane.org wrote: Hi Sergey, There is a sample_weights option (not very well documented) in the random forest classifier that might help. You might want to check out the SVC example to see the sample_weights format. http://scikit- learn.org/stable/auto_examples/svm/plot_weighted_samples.html You can provide different weights to different classes (for e.g., inversely proportional to the number of samples). -Manish On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeldman- re5jqeeqqe8avxtiumw...@public.gmane.org wrote: I'm dealing with a 50-class classification problem with extremely unbalanced classes. The smallest class has about 1000 samples and the largest has 500,000. The random forest I've trained is being heavily skewed towards the big classes. Is there a good way to deal with this kind of problem in sklearn as of now? Or is there room to implement some kind of stratified bootstrap strategy or a weighting strategy (as in here, for example)? What other non-linear classifiers in sklearn would be good for this kind of dataset? It's about 2 million examples in 500+ dimensions.Thanks, Sergey -- See everything from the browser to the database with AppDynamicsGet end- to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds.Start your free trial of AppDynamics Pro today!http://pubads.g.doubleclick.net/gampad/clk? id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing listScikit-learn-general- 5NWGOfrQmncRDUWM+popnw@public.gmane.orgforge.nethttps://lists.sourceforge.ne t/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk? id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing listScikit-learn-general- 5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.orghttps://lists.sourceforge.n et/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk? id=48808831iu=/4140/ostg.clktrk -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk? id=48808831iu=/4140/ostg.clktrk -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Trees with unbalanced classes
I'm dealing with a 50-class classification problem with extremely unbalanced classes. The smallest class has about 1000 samples and the largest has 500,000. The random forest I've trained is being heavily skewed towards the big classes. Is there a good way to deal with this kind of problem in sklearn as of now? Or is there room to implement some kind of stratified bootstrap strategy or a weighting strategy (as in herehttp://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf, for example)? What other non-linear classifiers in sklearn would be good for this kind of dataset? It's about 2 million examples in 500+ dimensions. Thanks, Sergey -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Trees with unbalanced classes
Hi Sergey, There is a sample_weights option (not very well documented) in the random forest classifier that might help. You might want to check out the SVC example to see the sample_weights format. http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html You can provide different weights to different classes (for e.g., inversely proportional to the number of samples). -Manish On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeld...@gmail.com wrote: I'm dealing with a 50-class classification problem with extremely unbalanced classes. The smallest class has about 1000 samples and the largest has 500,000. The random forest I've trained is being heavily skewed towards the big classes. Is there a good way to deal with this kind of problem in sklearn as of now? Or is there room to implement some kind of stratified bootstrap strategy or a weighting strategy (as in here, for example)? What other non-linear classifiers in sklearn would be good for this kind of dataset? It's about 2 million examples in 500+ dimensions. Thanks, Sergey -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Trees with unbalanced classes
Thanks, Manish! Exactly what I was looking for. On Fri, Jul 12, 2013 at 4:52 PM, Manish Amde manish...@gmail.com wrote: Hi Sergey, There is a sample_weights option (not very well documented) in the random forest classifier that might help. You might want to check out the SVC example to see the sample_weights format. http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html You can provide different weights to different classes (for e.g., inversely proportional to the number of samples). -Manish On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeld...@gmail.com wrote: I'm dealing with a 50-class classification problem with extremely unbalanced classes. The smallest class has about 1000 samples and the largest has 500,000. The random forest I've trained is being heavily skewed towards the big classes. Is there a good way to deal with this kind of problem in sklearn as of now? Or is there room to implement some kind of stratified bootstrap strategy or a weighting strategy (as in herehttp://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf, for example)? What other non-linear classifiers in sklearn would be good for this kind of dataset? It's about 2 million examples in 500+ dimensions. Thanks, Sergey -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general