Re: [Scikit-learn-general] Trees with unbalanced classes

2013-11-19 Thread Michael Lovell
Sergey Feldman sergeyfeldman@... writes:



 Hi Sergey,



I am having the exact same issue with my data set except its 80 dimensional 
and I only have two classes where the classes are unbalanced about 5:1.  I 
am also using random forest in sklearn. Just curious how well did using 
Manish's suggestion with sklearn help with your classifier?



Thank you!!



Michael

 

 Thanks, Manish!  Exactly what I was looking for.

 

 

 On Fri, Jul 12, 2013 at 4:52 PM, Manish Amde manish9ue-
re5jqeeqqe8avxtiumw...@public.gmane.org wrote:

 

 Hi Sergey,

 

 There is a sample_weights option (not very well documented) in the random 
forest classifier that might help. You might want to check out the SVC 
example to see the sample_weights format.

 

 http://scikit-
learn.org/stable/auto_examples/svm/plot_weighted_samples.html

 

 You can provide different weights to different classes (for e.g., 
inversely proportional to the number of samples). 

 

 

 -Manish

 

 

 On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeldman-
re5jqeeqqe8avxtiumw...@public.gmane.org wrote:

 

 

 

 

 

 

 I'm dealing with a 50-class classification problem with extremely 
unbalanced classes.  The smallest class has about 1000 samples and the 
largest has 500,000.  The random forest I've trained is being heavily 
skewed towards the big classes.  

 Is there a good way to deal with this kind of problem in sklearn as of 
now?  Or is there room to implement some kind of stratified bootstrap 
strategy or a weighting strategy (as in here, for example)?

 

 What other non-linear classifiers in sklearn would be good for this kind 
of dataset?  It's about 2 million examples in 500+ dimensions.Thanks,

 Sergey

 

 --
See everything from the browser to the database with AppDynamicsGet end-
to-end visibility with application monitoring from AppDynamics

 

 Isolate bottlenecks and diagnose root cause in seconds.Start your free 
trial of AppDynamics Pro today!http://pubads.g.doubleclick.net/gampad/clk?
id=48808831iu=/4140/ostg.clktrk
___

 

 Scikit-learn-general mailing listScikit-learn-general-
5NWGOfrQmncRDUWM+popnw@public.gmane.orgforge.nethttps://lists.sourceforge.ne
t/lists/listinfo/scikit-learn-general

 

 

 

 

 --


 See everything from the browser to the database with AppDynamics

 Get end-to-end visibility with application monitoring from AppDynamics

 Isolate bottlenecks and diagnose root cause in seconds.

 Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?
id=48808831iu=/4140/ostg.clktrk
___

 

 

 Scikit-learn-general mailing listScikit-learn-general-
5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.orghttps://lists.sourceforge.n
et/lists/listinfo/scikit-learn-general

 

 

 

 

 

 

 --


 See everything from the browser to the database with AppDynamics

 Get end-to-end visibility with application monitoring from AppDynamics

 Isolate bottlenecks and diagnose root cause in seconds.

 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?
id=48808831iu=/4140/ostg.clktrk

 

 --


 See everything from the browser to the database with AppDynamics

 Get end-to-end visibility with application monitoring from AppDynamics

 Isolate bottlenecks and diagnose root cause in seconds.

 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?
id=48808831iu=/4140/ostg.clktrk








--
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Sergey Feldman
I'm dealing with a 50-class classification problem with extremely
unbalanced classes.  The smallest class has about 1000 samples and the
largest has 500,000.  The random forest I've trained is being heavily
skewed towards the big classes.

Is there a good way to deal with this kind of problem in sklearn as of
now?  Or is there room to implement some kind of stratified bootstrap
strategy or a weighting strategy (as in
herehttp://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf,
for example)?

What other non-linear classifiers in sklearn would be good for this kind of
dataset?  It's about 2 million examples in 500+ dimensions.

Thanks,
Sergey
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Manish Amde
Hi Sergey,

There is a sample_weights option (not very well documented) in the random 
forest classifier that might help. You might want to check out the SVC example 
to see the sample_weights format.
http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

You can provide different weights to different classes (for e.g., inversely 
proportional to the number of samples). 

-Manish

On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeld...@gmail.com wrote:

 I'm dealing with a 50-class classification problem with extremely unbalanced 
 classes.  The smallest class has about 1000 samples and the largest has 
 500,000.  The random forest I've trained is being heavily skewed towards the 
 big classes.  
 
 Is there a good way to deal with this kind of problem in sklearn as of now?  
 Or is there room to implement some kind of stratified bootstrap strategy or a 
 weighting strategy (as in here, for example)?
 
 What other non-linear classifiers in sklearn would be good for this kind of 
 dataset?  It's about 2 million examples in 500+ dimensions.
 
 Thanks,
 Sergey
 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Sergey Feldman
Thanks, Manish!  Exactly what I was looking for.


On Fri, Jul 12, 2013 at 4:52 PM, Manish Amde manish...@gmail.com wrote:

 Hi Sergey,

 There is a sample_weights option (not very well documented) in the random
 forest classifier that might help. You might want to check out the SVC
 example to see the sample_weights format.
 http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

 You can provide different weights to different classes (for e.g.,
 inversely proportional to the number of samples).

 -Manish

 On Jul 12, 2013, at 4:40 PM, Sergey Feldman sergeyfeld...@gmail.com
 wrote:

 I'm dealing with a 50-class classification problem with extremely
 unbalanced classes.  The smallest class has about 1000 samples and the
 largest has 500,000.  The random forest I've trained is being heavily
 skewed towards the big classes.

 Is there a good way to deal with this kind of problem in sklearn as of
 now?  Or is there room to implement some kind of stratified bootstrap
 strategy or a weighting strategy (as in 
 herehttp://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf,
 for example)?

 What other non-linear classifiers in sklearn would be good for this kind
 of dataset?  It's about 2 million examples in 500+ dimensions.

 Thanks,
 Sergey

 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general