Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-03-20 Thread Manish Amde
I have a follow up question regarding the usage of sample_weights for
fitting the RandomForestClassifier. Does the predict_proba method take the
sample weights (used during fitting) into account as well? I spent some
time trying to understand the _tree.pyc and tree.py files in the codebase
but still I am a little fuzzy about how the predict_proba code works when
the sample_weights are present.

I have an unbalanced data set (1:12 ratio) and I find that the
probabilities are highly skewed towards the majority class even after using
sample weights.

I am planning to use isotonic regression to calibrate my predictions but it
will be nice to have a less skewed input into the calibration algorithms.


On Thu, Feb 7, 2013 at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote:

 Hello,

 You might achieve what you want by using sample weights when fitting
 your forest (See the 'sample_weight' parameter). There is also a
 'balance_weights' method from the preprocessing module that basically
 generates sample weights for you, such that classes become balanced.


 https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221

 (This should appear in the reference, I'll fix that)

 Hope this helps,

 Gilles

 On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote:
  Fellow sklearners,
 
  I am working on a classification problem with an unbalanced data set and
  have been successful using SVM classifiers with the class_weight option.
 
  I have also tried Random Forests and am getting a decent ROC performance
 but
  I am hoping to get a performance improvement by using Weighted or
 Balanced
  Random Forests as suggested in this paper.
  http://www.stat.berkeley.edu/tech-reports/666.pdf
 
  I don't see any implementation of these options but I might be mistaken
 so I
  wanted to ask the community. Also, I am willing to write code and
 contribute
  back if this will be useful to other folks.
 
  I have also thought about balancing the data using up/down sampling the
  minority/majority class (with or without replacement) and even SMOTE but
  couldn't find those implementation in the scikit-learn library yet.  The
  modified Random Forests seem to outperform these methods according to the
  paper, hence I am interested in trying those first.
 
  -Manish
 
 
 --
  Free Next-Gen Firewall Hardware Offer
  Buy your Sophos next-gen firewall before the end March 2013
  and get the hardware for free! Learn more.
  http://p.sf.net/sfu/sophos-d2d-feb
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Free Next-Gen Firewall Hardware Offer
 Buy your Sophos next-gen firewall before the end March 2013
 and get the hardware for free! Learn more.
 http://p.sf.net/sfu/sophos-d2d-feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-08 Thread Jeff Elmore
I've been wrestling with this same issue in the regression case.

I realize it's not as straight forward to balance continuous data as it is
for discrete classes of output.

But I wonder if this list has any thoughts about how it might be approached.

The data I'm predicting is distributed normally and particularly when
sample sizes are small the tails tend to be neglected and poorly predicted.

Thoughts?


On Fri, Feb 8, 2013 at 2:44 AM, Manish Amde manish...@gmail.com wrote:

 Thanks Gilles. This definitely helps. I am glad I asked. :-)

 -Manish

 On Feb 7, 2013, at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote:

  Hello,
 
  You might achieve what you want by using sample weights when fitting
  your forest (See the 'sample_weight' parameter). There is also a
  'balance_weights' method from the preprocessing module that basically
  generates sample weights for you, such that classes become balanced.
 
 
 https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
 
  (This should appear in the reference, I'll fix that)
 
  Hope this helps,
 
  Gilles
 
  On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote:
  Fellow sklearners,
 
  I am working on a classification problem with an unbalanced data set and
  have been successful using SVM classifiers with the class_weight option.
 
  I have also tried Random Forests and am getting a decent ROC
 performance but
  I am hoping to get a performance improvement by using Weighted or
 Balanced
  Random Forests as suggested in this paper.
  http://www.stat.berkeley.edu/tech-reports/666.pdf
 
  I don't see any implementation of these options but I might be mistaken
 so I
  wanted to ask the community. Also, I am willing to write code and
 contribute
  back if this will be useful to other folks.
 
  I have also thought about balancing the data using up/down sampling the
  minority/majority class (with or without replacement) and even SMOTE but
  couldn't find those implementation in the scikit-learn library yet.  The
  modified Random Forests seem to outperform these methods according to
 the
  paper, hence I am interested in trying those first.
 
  -Manish
 
 
 --
  Free Next-Gen Firewall Hardware Offer
  Buy your Sophos next-gen firewall before the end March 2013
  and get the hardware for free! Learn more.
  http://p.sf.net/sfu/sophos-d2d-feb
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 --
  Free Next-Gen Firewall Hardware Offer
  Buy your Sophos next-gen firewall before the end March 2013
  and get the hardware for free! Learn more.
  http://p.sf.net/sfu/sophos-d2d-feb
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Free Next-Gen Firewall Hardware Offer
 Buy your Sophos next-gen firewall before the end March 2013
 and get the hardware for free! Learn more.
 http://p.sf.net/sfu/sophos-d2d-feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Gilles Louppe
Hello,

You might achieve what you want by using sample weights when fitting
your forest (See the 'sample_weight' parameter). There is also a
'balance_weights' method from the preprocessing module that basically
generates sample weights for you, such that classes become balanced.

https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221

(This should appear in the reference, I'll fix that)

Hope this helps,

Gilles

On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote:
 Fellow sklearners,

 I am working on a classification problem with an unbalanced data set and
 have been successful using SVM classifiers with the class_weight option.

 I have also tried Random Forests and am getting a decent ROC performance but
 I am hoping to get a performance improvement by using Weighted or Balanced
 Random Forests as suggested in this paper.
 http://www.stat.berkeley.edu/tech-reports/666.pdf

 I don't see any implementation of these options but I might be mistaken so I
 wanted to ask the community. Also, I am willing to write code and contribute
 back if this will be useful to other folks.

 I have also thought about balancing the data using up/down sampling the
 minority/majority class (with or without replacement) and even SMOTE but
 couldn't find those implementation in the scikit-learn library yet.  The
 modified Random Forests seem to outperform these methods according to the
 paper, hence I am interested in trying those first.

 -Manish

 --
 Free Next-Gen Firewall Hardware Offer
 Buy your Sophos next-gen firewall before the end March 2013
 and get the hardware for free! Learn more.
 http://p.sf.net/sfu/sophos-d2d-feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Manish Amde
Thanks Gilles. This definitely helps. I am glad I asked. :-)

-Manish

On Feb 7, 2013, at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote:

 Hello,
 
 You might achieve what you want by using sample weights when fitting
 your forest (See the 'sample_weight' parameter). There is also a
 'balance_weights' method from the preprocessing module that basically
 generates sample weights for you, such that classes become balanced.
 
 https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
 
 (This should appear in the reference, I'll fix that)
 
 Hope this helps,
 
 Gilles
 
 On 8 February 2013 00:44, Manish Amde manish...@gmail.com wrote:
 Fellow sklearners,
 
 I am working on a classification problem with an unbalanced data set and
 have been successful using SVM classifiers with the class_weight option.
 
 I have also tried Random Forests and am getting a decent ROC performance but
 I am hoping to get a performance improvement by using Weighted or Balanced
 Random Forests as suggested in this paper.
 http://www.stat.berkeley.edu/tech-reports/666.pdf
 
 I don't see any implementation of these options but I might be mistaken so I
 wanted to ask the community. Also, I am willing to write code and contribute
 back if this will be useful to other folks.
 
 I have also thought about balancing the data using up/down sampling the
 minority/majority class (with or without replacement) and even SMOTE but
 couldn't find those implementation in the scikit-learn library yet.  The
 modified Random Forests seem to outperform these methods according to the
 paper, hence I am interested in trying those first.
 
 -Manish
 
 --
 Free Next-Gen Firewall Hardware Offer
 Buy your Sophos next-gen firewall before the end March 2013
 and get the hardware for free! Learn more.
 http://p.sf.net/sfu/sophos-d2d-feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
 Free Next-Gen Firewall Hardware Offer
 Buy your Sophos next-gen firewall before the end March 2013 
 and get the hardware for free! Learn more.
 http://p.sf.net/sfu/sophos-d2d-feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general