from:"Peter Prettenhofer"

Re: [Scikit-learn-general] [ANN] scikit-learn 0.16.0 is out!

2015-03-27 Thread Peter Prettenhofer

Hurray, great work everybody!

2015-03-27 19:51 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org:

 Works for me. Could you try refreshing your brower cache (Ctrl Shift R on
 some browsers).

 Gaël

 On Fri, Mar 27, 2015 at 06:23:06PM +, Jason Sanchez wrote:
  Update: For me, the stable documentation works, but the 0.16
 documentation does not.

  Works:
 http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
  Does not work:
 http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html

  I have seen the updated images both in 0.16 and 0.15, in which 0.16
  algorithms has less running time than in 0.15.

  Wei

  On Fri, Mar 27, 2015 at 1:14 PM, Jason Sanchez 
  jason.sanchez.m...@statefarm.com wrote:

The documentation for the release does not seem to include any of the
   images. Perhaps this is just showing on my end.

   Example:
   0.16:
   *
 http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html*
   
 http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html
 
   0.15:
   *
 http://scikit-learn.org/0.15/auto_examples/cluster/plot_cluster_comparison.html*
   
 http://scikit-learn.org/0.15/auto_examples/cluster/plot_cluster_comparison.html
 

 
 --
  Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
  by Intel and developed in partnership with Slashdot Media, is your hub
 for all
  things parallel software development, from weekly thought leadership
 blogs to
  news, videos, case studies, tutorials and more. Take a look and join the
  conversation now. http://goparallel.sourceforge.net/
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 Gael Varoquaux
 Researcher, INRIA Parietal
 Laboratoire de Neuro-Imagerie Assistee par Ordinateur
 NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
 Phone:  ++ 33-1-69-08-79-68
 http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux


 --
 Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for
 all
 things parallel software development, from weekly thought leadership blogs
 to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Welcome new core contributors

2014-10-13 Thread Peter Prettenhofer

gogogo team!!

2014-10-13 9:19 GMT+02:00 Arnaud Joly a.j...@ulg.ac.be:

 Congratulation !!!

 Arnaud

 On 13 Oct 2014, at 03:13, Kyle Kastner kastnerk...@gmail.com wrote:

  Thanks everyone! There are some nice new extensions for that algorithm
  planned (randomized SVD!) once I get a moment to submit the proper PR.
  I am happy to be able to contribute for such an awesome group :)
 
  On Sun, Oct 12, 2014 at 3:55 PM, abhishek abhish...@gmail.com wrote:
  Congrats  Kyle! I was waiting for this eagerly
 
  On Oct 12, 2014 9:31 PM, Robert Layton robertlay...@gmail.com
 wrote:
 
  Congrats!
 
  On 13 October 2014 05:42, Manoj Kumar manojkumarsivaraj...@gmail.com
  wrote:
 
  Thanks Gaël,
 
  Its a pleasure. Looking forward to learning and contributing more.
 
  On Sun, Oct 12, 2014 at 5:24 PM, Gael Varoquaux
  gael.varoqu...@normalesup.org wrote:
 
  I am happy to welcome new core contributors to scikit-learn:
   - Alexander Fabisch (@AlexanderFabisch)
   - Kyle Kastner (@kastnerkyle)
   - Manoj Kumar (@MechCoder)
   - Noel Dawe (@ndawe)
 
  Thank you all for your hard work on scikit-learn, and welcome to the
  team!
 
  Gaël
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
  http://p.sf.net/sfu/Zoho
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 
  --
  Godspeed,
  Manoj Kumar,
  Mech Undergrad
  http://manojbits.wordpress.com
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
  http://p.sf.net/sfu/Zoho
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS
 Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
  http://p.sf.net/sfu/Zoho
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
  http://p.sf.net/sfu/Zoho
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 --
  Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
  Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
  Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
  Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
  http://p.sf.net/sfu/Zoho
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
 http://p.sf.net/sfu/Zoho
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0

Re: [Scikit-learn-general] Sparse Gradient Boosting Fully Corrective Gradient Boosting

2014-09-22 Thread Peter Prettenhofer

Key advantage of using RuleFit [1] -- striking that they didnt cite it btw
-- is that if you add the original features your model can a) better
incorporate additive effects and b) extrapolate, a limitation of any
tree-based method like GBRT or RF.

[1] http://statweb.stanford.edu/~jhf/R-RuleFit.html

2014-09-22 20:48 GMT+02:00 Olivier Grisel olivier.gri...@ensta.org:

 2014-09-21 10:46 GMT+02:00 Mathieu Blondel math...@mblondel.org:
 
 
  On Sun, Sep 21, 2014 at 1:55 AM, Olivier Grisel 
 olivier.gri...@ensta.org
  wrote:
 
  On a related note, here is an implementeation of Logistic Regression
  applied to one-hot features obtained from leaf membership info of a
  GBRT model:
 
 
 
 http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb#Using-the-boosted-trees-to-extract-features-for-a-Logistic-Regression-model
 
  This is inspired by this paper from Facebook:
  https://www.facebook.com/publications/329190253909587/ .
 
  It's easy to implement and seems to work quite well.
 
 
  What is the advantage of this method over using GBRT directly?

 A significant improvement in F1-score for the positive / minority
 class and ROC AUC on this dataset (Adult Census binarized income
 prediction with integer encoding of the categorical variables).

 Apparently the facebook ad team reported the same kind of improvement
 on their own data.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
 Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
 Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
 Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer

 http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sparse Gradient Boosting Fully Corrective Gradient Boosting

2014-09-16 Thread Peter Prettenhofer

The only reference I know is the Regularized Greedy Forest paper by Johnson
and Zhang [1]
I havent read the primary source (by Zhang as well).

[1] http://arxiv.org/abs/1109.0887

2014-09-16 15:15 GMT+02:00 Mathieu Blondel math...@mblondel.org:

 Could you give a reference for gradient boosting with fully corrective
 updates?

 Since the philosophy of gradient boosting is to fit each tree against the
 residuals (or negative gradient) so far, I am wondering how such fully
 corrective update would work...

 Mathieu

 On Tue, Sep 16, 2014 at 9:16 AM, c TAKES ctakesli...@gmail.com wrote:

 Is anyone working on making Gradient Boosting Regressor work with sparse
 matrices?

 Or is anyone working on adding an option for fully corrective gradient
 boosting, I.E. all trees in the ensemble are re-weighted at each iteration?

 These are things I would like to see and may be able to help with if no
 one is currently working on them.


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.

 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.

 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Bug in OneClassSVM

2014-09-15 Thread Peter Prettenhofer

Hi Luca,

it segfaults?! Can you confirm that it also segfaults if you use the
default arguments? There is no plot so I cannot say anything about the
strange decision boundaries.

For my part, I've never used something else than a RBF kernel for a one
class svm; the RBF kernel has the nice property that all data points lie on
the surface of a hypersphere and thus the minimum enclosing ball is just
the hyperplane that separates those points and the origin with the max
distance to the origin.

2014-09-15 10:58 GMT+02:00 Luca Puggini lucapug...@gmail.com:


 Hi,
 I am having some problems with the OneClassSVM function.

 Here you can see my file and the output.
 http://justpaste.it/h3pw

 I am sorry but I can not share the used data.

 I have experienced also other problems like strange decision boundaries.

 Can someone tell me if I am doing something wrong or if there is a problem
 in the function?

 Thanks,
 Luca





 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.

 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Peter Prettenhofer

+1 -- looks like a very handy 3-liner :)

2014-09-08 16:14 GMT+02:00 Gilles Louppe g.lou...@gmail.com:

 Hi Luca,

 This may not be the fastest implementation, but random forest
 proximities can be computed quite straightforwardly in Python given
 our 'apply' function.
 See for instance

 https://github.com/glouppe/phd-thesis/blob/master/scripts/ch4_proximity.py#L12

 From a personal point of view, I never use them but since this is
 quite standard in other random forest implementations, this may be a
 nice little contribution. I dont know where it should be put though in
 scikit-learn, since it very much looks like a pairwise metric.

 What do other tree growers think?

 Cheers,
 Gilles

 On 8 September 2014 11:05, Luca Puggini lucapug...@gmail.com wrote:
  Hi,
  for personal reason I am writing a function to compute the outlier
 measure
  from random forest
  http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers
 
  with a little more work I can include the function in the sklearn random
  forest class.
 
  Is the community interested? Should I do it?
  I think that this would be useful.  This function is already available in
  matlab
 http://www.mathworks.co.uk/help/stats/compacttreebagger-class.html
 
  Let me know.
 
  Best,
  Luca
 
 
 --
  Want excitement?
  Manually upgrade your production database.
  When you want reliability, choose Perforce
  Perforce version control. Predictably reliable.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.

 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Libsvm, probabilities and weights

2014-08-13 Thread Peter Prettenhofer

Thanks Mathieu, I agree -- a calibration module would be good to have
anyways.

I filed an issue on libsvms github account [1]

[1] https://github.com/cjlin1/libsvm/issues/13


2014-08-13 3:00 GMT+02:00 Mathieu Blondel math...@mblondel.org:

 sample_weights in scikit-learn comes from a libsvm patch:
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

 So it would seem like probability calibration was omitted from this patch
 :-(

 When our calibration module is ready, we could handle the calibration
 post-processing ourselves in pure Python.

 Could you report an issue?

 Mathieu


 On Wed, Aug 13, 2014 at 3:33 AM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 SVC doesnt take class/sample weights into account when calibrating
 probabilities -- this seems to be a bug to me...


 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/libsvm/svm.cpp#L1895

 best,
  Peter

 --
 Peter Prettenhofer


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sparse data, SGD, and intercept_decay

2014-07-30 Thread Peter Prettenhofer

The way I implemented it, the learning rate for the intercept should be
0.01 times the learning rate of the other features.
The value of .01 is something that I set empirically, I adopted it from
Leon Buttou's sgd project and experimented with different values. I found
that lower intercept learning rates help a bit but the concrete value is
not too important - so I decided to use a fixed value.
I think the decay value might in fact be a function of the number of
non-zero values per feature. If you have a dataset with sparse and dense
features then intercept decay should be turned off -- alternatively, you
can also scale the dense features to decrease their magnitude.


2014-07-30 11:42 GMT+02:00 Danny Sullivan dsulliv...@hotmail.com:

 I found that for sparse data, the scikit implementation of sgd uses an
 intercept_decay variable set to .01 (SPARSE_INTERCEPT_DECAY) to avoid
 intercept oscillation. Shouldn't this be determined by the learning_rate
 instead? I'm asking because it adds a layer of tuning that the user doesn't
 have control over.

 Danny


 --
 Infragistics Professional
 Build stunning WinForms apps today!
 Reboot your WinForms applications with our WinForms controls.
 Build a bridge from your legacy apps to the future.

 http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Confidence score for each prediction from regressor

2014-07-22 Thread Peter Prettenhofer

Hi Yogesh,

one of the few regressors that supports this in sklearn is GaussianProcess
but that wont scale to your problem.
An alternative is to use a GradientBoostingRegressor with quantile loss to
generate prediction intervals (see [1]) -- only for the keen - i've once
used that unsuccessfully in a Kaggle comp. Its not a confidence score
though -- it can only tell you if its within a band.
Maybe one can generate a confidence score from Random Forests... I remember
that I read something along those lines in this survey [2].

best,
Peter

[1]
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
[2] http://research.microsoft.com/apps/pubs/default.aspx?id=12

2014-07-22 19:52 GMT+02:00 Yogesh Pandit yogesh...@gmail.com:

Hello,

I am working with regressors (sklearn.ensemble). Shape of my test data
is (1121280, 452)

I am wondering on how I can associate a confidence score for prediction
for each sample from my test data. Any suggestions would be helpful. Thank
you,

-Yogesh

--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer
--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Confidence score for each prediction from regressor

2014-07-22 Thread Peter Prettenhofer

I might be wrong but it seems like Mathieu is working on something similar
for Ridge this: https://github.com/scikit-learn/scikit-learn/pull/3417

2014-07-22 21:47 GMT+02:00 Peter Prettenhofer peter.prettenho...@gmail.com
:

Hi Yogesh,

one of the few regressors that supports this in sklearn is GaussianProcess
but that wont scale to your problem.
An alternative is to use a GradientBoostingRegressor with quantile loss to
generate prediction intervals (see [1]) -- only for the keen - i've once
used that unsuccessfully in a Kaggle comp. Its not a confidence score
though -- it can only tell you if its within a band.
Maybe one can generate a confidence score from Random Forests... I
remember that I read something along those lines in this survey [2].

best,
Peter

[1]
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
[2] http://research.microsoft.com/apps/pubs/default.aspx?id=12

2014-07-22 19:52 GMT+02:00 Yogesh Pandit yogesh...@gmail.com:

Hello,

I am working with regressors (sklearn.ensemble). Shape of my test data
is (1121280, 452)

I am wondering on how I can associate a confidence score for prediction
for each sample from my test data. Any suggestions would be helpful. Thank
you,

-Yogesh

--
Peter Prettenhofer

Re: [Scikit-learn-general] scikit-learn 0.15.0 is out \o/

2014-07-15 Thread Peter Prettenhofer

great work guys - thanks!


2014-07-15 13:18 GMT+02:00 Satrajit Ghosh sa...@mit.edu:

 congrats all !

 cheers,

 satra

 On Tue, Jul 15, 2014 at 7:13 AM, Olivier Grisel olivier.gri...@ensta.org
 wrote:

 http://scikit-learn.org/stable/whats_new.html

 Plenty of wheel packages on PyPI and people rejoice :)

 Thanks to all for your contributions!

 I know the website is half incorrect (especially the 0.14/ that has
 the 0.15 content). I screwed up again with rsync and symlinks. I am
 rebuilding a clean doc at the moment.

 Best,

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Want fast and easy access to all the code in your enterprise? Index and
 search up to 200,000 lines of code with a free copy of Black Duck
 Code Sight - the same software that powers the world's largest code
 search on Ohloh, the Black Duck Open Hub! Try it now.
 http://p.sf.net/sfu/bds
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Getting decision tree regressor to predict using median not mean, of final subset

2014-06-23 Thread Peter Prettenhofer

Hi James,

if you look at the LAD loss function in the gradient_boosting module you
can find an example how to do it. Basically, you need to update the values
array in the Tree extension type. Tree.apply_Tree(x_train) gives you the
training instances in each leaf.

HTH,
Peter
Am 23.06.2014 13:48 schrieb James McMurray jamesmc...@gmail.com:

 Hi,

 I want to use the decision tree regressor to predict using the median of
 the resulting subset from the tree, rather than the mean?

 Is there a simple way to do this?

 I looked at the code, but in sklearn/tree/tree.py, the only relevant line
 is:
 proba = self.tree_.predict(X)

 Where the prediction is already done (presumably in the Cython code), I
 don't have experience with Cython so I'm not sure how to modify _tree.pyx
 to do this.

 Many thanks,
 James McMurray


 --
 HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
 Find What Matters Most in Your Big Data with HPCC Systems
 Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
 Leverages Graph Analysis for Fast Processing  Easy Data Exploration
 http://p.sf.net/sfu/hpccsystems
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing  Easy Data Exploration
http://p.sf.net/sfu/hpccsystems___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] My talk was approved for EuroScipy'14

2014-05-22 Thread Peter Prettenhofer

congrats Gilles -- looking forward to your talk -- you should definitely
make a blog post from your material (and benchmarks)!


2014-05-22 8:50 GMT+02:00 Vlad Niculae zephy...@gmail.com:

 This is great news, congratulations Gilles!

 Cheers,
 Vlad
 On May 22, 2014 8:15 AM, Gilles Louppe g.lou...@gmail.com wrote:

 Hi folks,

 Just for letting you know, my talk Accelerating Random Forests in
 Scikit-Learn was approved for EuroScipy'14. Details can be found at
 https://www.euroscipy.org/2014/schedule/presentation/9/.

 My slides are far from being ready, but my intention is to present our
 team efforts on the tree and ensemble modules, including along the way
 some of the lessons we have learned.

 In particular, I would like to thank @pprett, @arjoly, @larsmans,
 @ogrisel and @jnothman who have contributed a lot these last months to
 improve these modules! Thanks guys!

 Cheers,
 Gilles


 --
 Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
 Instantly run your Selenium tests across 300+ browser/OS combos.
 Get unparalleled scalability from the best Selenium testing platform
 available
 Simple to use. Nothing to install. Get started now for free.
 http://p.sf.net/sfu/SauceLabs
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
 Instantly run your Selenium tests across 300+ browser/OS combos.
 Get unparalleled scalability from the best Selenium testing platform
 available
 Simple to use. Nothing to install. Get started now for free.
 http://p.sf.net/sfu/SauceLabs
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free.
http://p.sf.net/sfu/SauceLabs___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RandomForestClassifier w/ IPython.parallel

2014-02-07 Thread Peter Prettenhofer

Hi Allessandro,

you might want to look into this presentation by Olivier
https://speakerdeck.com/ogrisel/growing-randomized-trees-in-the-cloud-1 --
it should be pretty much what you need. Code is here
https://github.com/pydata/pyrallel.

best,
 Peter


2014-02-07 23:28 GMT+01:00 Alessandro Gagliardi 
alessandro.gaglia...@glassdoor.com:

  Hi All,

  I want to run a large sklearn.ensemble.RandomForestClassifier (with
 maybe a dozens or maybe hundreds of trees and 100,000 samples). My desktop
 won't handle this so I want to try using StarCluster.
 RandomForestClassifier seems to parallelize easily, but I don't know how I
 would split it across many IPython.parallel engines (if that's even
 possible). (Or maybe I should be foregoing IPython.parallel and using MPI?)

  Any help would be greatly appreciated.

  Thanks,

  Alessandro Gagliardi| Glassdoor| alessan...@glassdoor.com

 *We're hiring! Check out our open jobs
 http://www.glassdoor.com/about/careers.htm.*

 *Twitter https://twitter.com/Glassdoor** | Facebook
 https://www.facebook.com/Glassdoor  | Glassdoor Blog
 http://www.glassdoor.com/blog/*

 *2012 Webby Award Winner: Best Employment Site*

 *2013 Webby Award Winner: Best Guides/Ratings/Review Site*


 --
 Managing the Performance of Cloud-Based Applications
 Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
 Read the Whitepaper.

 http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] joblib dump compression

2014-02-02 Thread Peter Prettenhofer

Awesome - thanks guys!

@Gael: I'll look into the single file storage and submit a PR


2014-02-02 Olivier Grisel olivier.gri...@ensta.org:

 I recently contributed a fix to numpy master (to be part of numpy
 1.9.0) to use nditer API to stream buffers to non-'file' file object:

 https://github.com/numpy/numpy/pull/4077

 That should make it possible to refactor joblib to stream pickled data
 to GzipFile instances or use the zlib.compressobj API to do to a
 single file compressed joblib.dump without memory copy.

 I had ongoing work to fix that issue tracked by
 https://github.com/joblib/joblib/issues/66 . But I had to stop to work
 on getting the threading backend in sklearn first. I plan to resume
 working on joblib/joblib#66 soonish (after Strata and the sklearn 0.15
 release).

 There is also this PR that is probably related (although I have not
 reviewed yet in details yet):

 https://github.com/joblib/joblib/pull/115

 --
 Olivier


 --
 WatchGuard Dimension instantly turns raw network data into actionable
 security intelligence. It gives you real-time visual feedback on key
 security issues and trends.  Skip the complicated setup - simply import
 a virtual appliance and go from zero to informed in seconds.

 http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] joblib dump compression

2014-02-01 Thread Peter Prettenhofer

Hi list,

sorry but I didn't find a dedicated joblib mailing list and since most of
the joblib contributors hang around here I thought I give it a shot.

I'm using joblib to dump scikit-learn RF models. When using compression is
the output always guaranteed to be stored in a single file? I looked at the
source and it seems to be this way but there might be a corner case if the
size of the object is too large?

thanks,
 Peter

-- 
Peter Prettenhofer
--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-Learn for android

2014-01-19 Thread Peter Prettenhofer

The structure of most learning algorithms is pretty simple (eg. linear
models or decision tree ensembles). A linear classifier for text
classification could be simply converted into a python dictionary where the
keys are terms and values are the coefficients (``coef_``) of the linear
classifier - using sparse regularization (L1) helps a lot to keep memory
requirements low.
Decision trees can be translated into a series of if-then-else statements
that can be evaled (if you are brave).

best,
 Peter


2014/1/20 Joel Nothman joel.noth...@gmail.com


  Do you have any specific use case in mind for running scikit-learn on
  Android?  Maybe an interesting and more useful project instead would be
  to implement PMML (Predictive Model Markup Language) exporters.

 Yes, I thought in this direction too (although last time I looked at PMML
 I got scared off). Most of the time you just want a model that can be
 trained offline and deployed on Android. I'm sure there are cases where an
 Android app will want to perform learning online, but it might be more
 sensible for the statistics to be collected on the Android, and pushed to a
 server for modelling.


 On 20 January 2014 11:37, Vlad Niculae zephy...@gmail.com wrote:

 I don't think Weka (at least the interesting parts of it) could run on
 Android either. I don't really foresee the whole Scipy stack running on
 Android; maybe one day when all dependencies are rewritten in PyPy and
 are faster and still 100% compatible...

 One thing that would be possible (but I don't know whether it would be
 useful for any appliers) would be to implement a prediction-only
 library, so you could develop models on your PC or in the cloud,
 download the pickled estimator and deploy it.  However I think people
 who need to do this end up writing the whole custom predictor; as it'd
 be more efficient.

 Do you have any specific use case in mind for running scikit-learn on
 Android?  Maybe an interesting and more useful project instead would be
 to implement PMML (Predictive Model Markup Language) exporters.

 My 2c,
 Vlad

 On Mon Jan 20 00:24:16 2014, Olivier Grisel wrote:
  2014/1/20 Tejas Nikumbh tejasniku...@gmail.com:
  Hi guys,
 
Is there a way we can utilise scikit-learn in android based
 projects?
 
  AFAIK, no.
 
  If not , does this sound like a good idea for a project [possibly a
 gsoc
  project]? What might be the hurdles associated?
 
  Trying to build scipy and its fortran build and runtime dependencies
  on Android is going to be fun :)
 


 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Releasing joblib 0.8a

2013-12-20 Thread Peter Prettenhofer

Actually, I'd propose to turn off multiprocessing at prediction time - this
might backfire quite easily.


2013/12/20 Olivier Grisel olivier.gri...@ensta.org

 2013/12/20 Vlad Niculae zephy...@gmail.com:
  Works exactly as you described on my machine (which doesn't mean much
  because it's relatively close to yours, but I am just too enthusiastic
  about this not to reply! \o/)
 
  Memory usage is as expected. I see a speedup in train time but a
  slight slowdown in test time (1.7 vs 1.0), is it expected or probably
  an artefact?

 Threading is not (yet) used at test time as the cython code backing
 the predict method would need to be refactored to release the GIL to
 make threading efficient.

 So the performance speed decrease you observe might be caused by the
 new automated memmaping feature that dumps large arrays to use share
 memory with with worker process when the multiprocessing backend is
 used. Currently the threshold to trigger the automated memmaping is
 set to 1MB arrays or larger. Maybe this is too small and we should
 trigger it only for arrays larger than 100MB for instance.

 How big is the data array in your case, is this the covertype benchmark?

 --
 Olivier


 --
 Rapidly troubleshoot problems before they affect your business. Most IT
 organizations don't have a clear picture of how application performance
 affects their revenue. With AppDynamics, you get 100% visibility into your
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics
 Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Defining a custom correlation kernel for GaussianProcess in the form K(x, x')

2013-12-10 Thread Peter Prettenhofer

Hi Ralf,

unfortunately, I cannot answer your question but it would be indeed very
valuabe to allow custom correlation functions.

best,
 Peter


2013/12/9 Ralf Gunter ralfgun...@gmail.com

 Hi all,

 We're trying to use a custom correlation kernel with GP in the usual
 form K(x, x'). However, by looking at the built-in correlation models
 (and how they're used by gaussian_process.py) it seems sklearn only
 takes models in the form K(theta, dx). There may very well be a
 reformulation of our K that depends only on (x-x'), but if so it would
 probably be highly non-trivial as it depends on e.g. modified
 spherical bessel functions evaluated at a scaled product of the xs. Is
 there any way to have the GP module take our kernel without modifying
 the GP code?

 I apologize if this has been asked/answered before -- some searching
 on google only led me to models that also depend only on (x-x').

 Thanks!


 --
 Sponsored by Intel(R) XDK
 Develop, test and display web and hybrid apps with a single code base.
 Download it for free now!

 http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Decision tree nodes labels

2013-12-08 Thread Peter Prettenhofer

Hi Caleb,

you need to extract the path from the decision tree structure
``DecisionTreeClassifier.tree_`` - take a look at the attributes
``children_left`` and ``children_right`` - these encode the parent-child
relationship.
Extracting the path is very similar to finding the leaf node; you just need
to keep track of what choices you made along the way - just modify
``sklearn.tree._tree.Tree.apply`` [1] accordingly.

best,
 Peter

[1]
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1907


2013/12/8 Caleb cloverev...@yahoo.com

 Hi everyone,

 Given an instance (x_1,x_2,...,x_n), I want to know what about it that
 make the decision tree belong to certain class, ie: x_1  a, x_3 b,.
 = x is of class C.

 I notice that .apply can return the id of the leaf node that the instance
 falls in, but can I get the path from the root node down to this leaf node?

 Any idea?

 -
 Caleb

 --
 Sponsored by Intel(R) XDK
 Develop, test and display web and hybrid apps with a single code base.
 Download it for free now!

 http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Spark-backed implementations of scikit-learn estimators

2013-12-04 Thread Peter Prettenhofer

Great news - looking forward to the outcome of the sprint!


2013/12/4 Olivier Grisel olivier.gri...@ensta.org

 I meant San Francisco...

 --
 Olivier


 --
 Sponsored by Intel(R) XDK
 Develop, test and display web and hybrid apps with a single code base.
 Download it for free now!

 http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Array memory layout and slicing

2013-11-26 Thread Peter Prettenhofer

Hi all,

I'm currently modifying our tree code so that it runs on both fortran and c
continuous arrays. After some benchmarking I got aware of the following
numpy behavior that was contrary to what I was expecting::

   X = # some feature matrix
   X = np.asfortranarray(X)
   X.flags.f_contiguous
  True
   # so far so good
   X_train = X[:1000]
   X_train.flags.f_contiguous
  False
   X_train.flags.c_contiguous
  False
   # damn - seems like a view is neither c nor fortran continuous
   X_train = X_train.copy()  # lets materialize the view
   X_train.flags.f_contiguous
  False
   X_train.flags.c_contiguous
  True

In the tree code, I check if an array is continuous - if not, I call
``np.asarray`` and set the ``order`` according to ``flags.f_contiguous`` or
``flags.c_contiguous``, however, in the case of views that does not work.
How would you handle this case?

thanks,
 Peter

-- 
Peter Prettenhofer
--
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Array memory layout and slicing

2013-11-26 Thread Peter Prettenhofer

2013/11/26 Olivier Grisel olivier.gri...@ensta.org

 2013/11/26 Peter Prettenhofer peter.prettenho...@gmail.com:
  Hi all,
 
  I'm currently modifying our tree code so that it runs on both fortran
 and c
  continuous arrays. After some benchmarking I got aware of the following
  numpy behavior that was contrary to what I was expecting::
 
 X = # some feature matrix
 X = np.asfortranarray(X)
 X.flags.f_contiguous
True
 # so far so good
 X_train = X[:1000]
 X_train.flags.f_contiguous
False
 X_train.flags.c_contiguous
False
 # damn - seems like a view is neither c nor fortran continuous

 Only if you slice the rows of a fortran aligned 2D array, this is
 expected. If you slices the rows of a C-contiguous 2D array or the
 columns of a F-contiguous 2D array it stays contiguous.


Actually, now that I think about it, it totally makes sense -- next time I
think before I write ;-)

thanks guys!




  import numpy as np
  a_c = np.arange(12).reshape(3, 4)
  a_f = np.asfortranarray(a_c)
  a_c.flags
   C_CONTIGUOUS : True
   F_CONTIGUOUS : False
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False
  a_f.flags
   C_CONTIGUOUS : False
   F_CONTIGUOUS : True
   OWNDATA : True
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False
  a_c[1:].flags
   C_CONTIGUOUS : True
   F_CONTIGUOUS : False
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False
  a_f[:, 1:].flags
   C_CONTIGUOUS : False
   F_CONTIGUOUS : True
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False



 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Shape the Mobile Experience: Free Subscription
 Software experts and developers: Be at the forefront of tech innovation.
 Intel(R) Software Adrenaline delivers strategic insight and game-changing
 conversations that shape the rapidly evolving mobile landscape. Sign up
 now.
 http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GradientBoostingRegressor with huber-loss and subsampling

2013-11-19 Thread Peter Prettenhofer

Hi Johannes,

The bug was fixed recently, please use the master while there is no 0.15
release.

Best,
Peter
Am 19.11.2013 16:33 schrieb hannithebunny hannithebu...@hotmail.de:

 Hi,

 in previous versions of scikit-learn I used GradientBoostingRegression
 with parameters:
 - loss =  'huber'
 - subsample =0.8

 After a sklearn update to version 0.14.1, I can use the 'huber'
 loss-function only if subsamble=1.0.
 For e.g. subsample=0.8 the error message below is displayed:
 ...
  reg = GradientBoostingRegressor(loss='huber',subsample=0.8)
  reg.fit(X,y)
 

 Traceback (most recent call last):
File C:\Users\xxx\GradientTreeRegressor.py, line 109, in module
  reg.fit(X,y)
File
 C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py,
 line 1126, in fit
  return super(GradientBoostingRegressor, self).fit(X, y)
File
 C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py,
 line 609, in fit
  y_pred[~sample_mask])
File
 C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py,
 line 253, in __call__
  gamma = self.gamma
 AttributeError: 'HuberLossFunction' object has no attribute 'gamma'

 Any help?

 Thanks and best regards
 Johannes


 --
 Shape the Mobile Experience: Free Subscription
 Software experts and developers: Be at the forefront of tech innovation.
 Intel(R) Software Adrenaline delivers strategic insight and game-changing
 conversations that shape the rapidly evolving mobile landscape. Sign up
 now.
 http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Benchmarking non-negative least squares solvers, work in progress

2013-11-08 Thread Peter Prettenhofer

SGDClassifier adopted the parameter names of ElasticNet (which has been
around in sklearn for longer) for consistency reasons.

I agree that we should strive for concise and intuitive parameter names
such as ``l1_ratio``.

Naming in sklearn is actually quite unfortunate since the popular R package
glmnet uses ``alpha`` for the ``l1_ratio``...


2013/11/8 Thomas Unterthiner thomas.unterthi...@gmx.net

 Just my 0.02$ as a user: I was also a confused/put-off by `alpha` and
 `l1_ratio` when I first explored SGDClassifier, I found those names to
 be pretty inconsistent --- plus I tend to call my regularization
 parameters `lambda` and use `alpha` for learning rates. I'm sure other
 people associate yet other meanings with alpha/use other names for the
 regularization parameter.  `l1_reg`/`l2_reg` would be much
 better/conciser names, it would be nice if those could be used all
 throughout sklearn.

 Cheers

 Thomas


 On 2013-11-08 09:20, Vlad Niculae wrote:
  Re: the discussion we had at PyCon.fr, I noticed that the internal
  elastic net coordinate descent functions are parametrized with
  `l1_reg` and `l2_reg`, but the exposed classes and functions have
  `alpha` and `l1_ratio`.  Only yesterday there was somebody on IRC who
  couldn't match Ridge with ElasticNet because of this parametrization.
 
  On Fri, Nov 8, 2013 at 9:02 AM, Olivier Grisel olivier.gri...@ensta.org
 wrote:
  About the LBFGS-B residuals (non-)issue I was probably confused by the
  overlapping on the plot and mis-interpreted the location of the PG-l1
  and PG-l2 curves.
 
  --
  Olivier
 
 --
  November Webinars for C, C++, Fortran Developers
  Accelerate application performance with scalable programming models.
 Explore
  techniques for threading, error checking, porting, and tuning. Get the
 most
  from the latest Intel processors and coprocessors. See abstracts and
 register
 
 http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 November Webinars for C, C++, Fortran Developers
 Accelerate application performance with scalable programming models.
 Explore
 techniques for threading, error checking, porting, and tuning. Get the most
 from the latest Intel processors and coprocessors. See abstracts and
 register
 http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] LambdaMART implementation and gbm comparison

2013-11-06 Thread Peter Prettenhofer

Hi Jacques,

very exciting -- this was on my wish list for quite a while.
maybe we should start creating a PR upfront so that we can discuss things
there -- better than using the mailing list (quite a lot of traffic
already).
The most important part of adding lambdaMart to sklearn is fleshing out an
API for learning to rank problems (ie we need to group samples by query
id) -- based on past experience this will take a while ;-) .
We should sync with Mathieu, Olivier, and Fabian -- if I remember
correctly, we have discussed this a while ago.

I've been reading through the GBM code lately to look at their best-first
tree building heuristic (again) -- we can definitely share experience there
-- source code is sometimes a bit verbose...

We should definitely take a look at Ranklib -- seems like its doing pretty
well here [1]. Otherwise, I too bench against gbm since its IMHO the
reference implementation of GBRT and a pretty good one as well. IMHO part
of the success of certain ML methods stems from the availability of high
quality implementations -- gbm definitely counts for one, libsvm/liblinear
too.

[1]
http://www.kaggle.com/c/expedia-personalized-sort/forums/t/6228/my-approach

best,
Peter

PS: Lucas Eustaquio pointed me to a python lambdaMart implementation that
uses sklear.tree.DecisionTreeRegressor:

https://github.com/discobot/LambdaMart/blob/acb8329ab63a45d2bcb43055fa54f14b8c6725c1/mart.py

2013/11/6 Jacques Kvam jwk...@gmail.com

Hello scikit-learn,

I recently wrote up an implementation of the LambdaMART algorithm on top
of the existing gradient boosting code (thanks for the great base of code
to work with btw). It currently only supports NDCG but it would be easy to
generalize. That's kind of besides the point however. Before I even think
about putting together a PR I wanted to compare it against the gbm package.
I'm aware of java implementations like jforest and ranklib but gbm's
interface seems closest to sklearn's so that's what I want to use.
Unfortunately whenever I try to use ndcg, it segfaults on me or I get an
error in split.default depending on where I specify the group variable. I
realize this isn't an R list but I was hoping someone could shed some light
for me.

I'm using the supervised MQ2007 and MQ2008 datasets from (
https://research.microsoft.com/en-us/um/beijing/projects/letor//letor4download.aspx)
and my test code is here (https://gist.github.com/jwkvam/7332448).

I simply use python to transform the given train.txt file into a csv so I
can load it in R. I'm using gbm 2.1 and I've tried R 2.15.3 and 3.0.2.

Alternatively can I easily transform my gbm.fit() call to use the gbm()
interface? Sorry I'm kind of a newbie when it comes to R.

I saw there's also this standing issue, but it doesn't look like there's
been a lot of movement on it.

https://code.google.com/p/gradientboostedmodels/issues/detail?id=28q=pairwise

Thanks,
Jacques

--
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models.
Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and
register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer
--
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] release time

2013-11-06 Thread Peter Prettenhofer

Given that snow will arrive late I too should be able to get some stuff
done as well.

I want to get  #2570 to MRG within one week so that we have plenty of time
to review and tweak.

Furthermore, I wanted to have a look a supporting different dtypes for SGD.

@Olivier: I will team up with you on reviewing MARS

best,
 Peter


2013/11/6 Lars Buitinck larsm...@gmail.com

 2013/11/6 Olivier Grisel olivier.gri...@ensta.org:
  I can help prepare the release by going through the open issues and
  pull requests on github and make a summary next week.
 
  All the three PRs highlighted by Gilles seem very important to me. I
  started reading the ESLII chapter on MARS soon to help with the review
  of the PR (I got interrupted by 2 conferences but will resume soon :).
 
  As for the timing of the release I have no strong opinion.
 
  Let's target the end of the year for a start and decide later if we
  need to shift the release date to January.

 I have time in the second half of December.


 --
 November Webinars for C, C++, Fortran Developers
 Accelerate application performance with scalable programming models.
 Explore
 techniques for threading, error checking, porting, and tuning. Get the most
 from the latest Intel processors and coprocessors. See abstracts and
 register
 http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] SGDRegressor.sparsify() = ValueError: dimension mismatch

2013-11-04 Thread Peter Prettenhofer

Hi Eustache,

that's quite a bug - thanks for reporting - I fixed it and added a sparsify
test to test_common.py - pushed directly to master.

thanks,
 Peter


2013/11/4 Eustache DIEMERT eusta...@diemert.fr

 Hi List,

 I'm currently working on some performance documentation [1] and I wanted
 to micro-benchmark the dense vs sparse coefficients case.

 I created a self-contained script and wanted to bench it using
 line_profiler, but it seems that after the call to `sparsify()` my
 SGDRegressor can't predict anymore (crashes with a dimensions mismatch
 error).

 Here is a gist to reproduce that: [2].

 The weird thing is that the coeffs_ attribute changes shape after the call
 to sparsify:
 (30,) - (1, 30)
 where 30 equals to n_features in my case.

 Any idea or explanation welcome !

 [1] https://github.com/scikit-learn/scikit-learn/pull/2488
 [2] https://gist.github.com/oddskool/7300982

 PS: The stack trace:
 
 Traceback (most recent call last):
   File /usr/local/bin/kernprof.py, line 233, in module
 sys.exit(main(sys.argv))
   File /usr/local/bin/kernprof.py, line 221, in main
 execfile(script_file, ns, ns)
   File sparsity_benchmark.py, line 52, in module
 score(y_test, clf.predict(X_test), 'sparse model')
   File
 /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py,
 line 903, in predict
 return self.decision_function(X)
   File
 /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py,
 line 888, in decision_function
 scores = safe_sparse_dot(X, self.coef_) + self.intercept_
   File /usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py,
 line 190, in safe_sparse_dot
 ret = a * b
   File /usr/lib/python2.7/dist-packages/scipy/sparse/base.py, line 311,
 in __rmul__
 return (self.transpose() * tr).transpose()
   File /usr/lib/python2.7/dist-packages/scipy/sparse/base.py, line 278,
 in __mul__
 raise ValueError('dimension mismatch')
 ValueError: dimension mismatch
 




 --
 Android is increasing in popularity, but the open development platform that
 developers love is also attractive to malware creators. Download this white
 paper to learn more about secure code signing practices that can help keep
 Android apps secure.
 http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] C integer types: the missing manual

2013-10-23 Thread Peter Prettenhofer

Hi Lars,

thanks heaps!

You should post this to the planet scipy RSS feed - I'm sure many people
share(d) my confusion about the topic.

best,
 Peter


2013/10/23 Lars Buitinck larsm...@gmail.com

 Dear all,

 I promised some time ago to write a guideline for using C integer
 types in Cython code. Here's a start; currently on the wiki instead of
 in a PR because of the rough state.


 https://github.com/scikit-learn/scikit-learn/wiki/C-integer-types:-the-missing-manual

 Regards,
 Lars


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] C integer types: the missing manual

2013-10-23 Thread Peter Prettenhofer

on the website it says: To ask for your feed to be added to the planet,
email Gael Varoquaux


2013/10/23 Lars Buitinck larsm...@gmail.com

 2013/10/23 Peter Prettenhofer peter.prettenho...@gmail.com:
  You should post this to the planet scipy RSS feed - I'm sure many people
  share(d) my confusion about the topic.

 How does that work?


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression

2013-10-22 Thread Peter Prettenhofer

Hi Attila,

please use the following adaptor::

def __init__(self, est):
self.est = est
def predict(self, X):
return self.est.predict_proba(X)
def fit(self, X, y):
self.est.fit(X, y)

The one in the stackoverflow question returns an array of shape
(n_samples,) but it should rather be (n_samples, n_classes).

PS: I still need to fix the init issue but any solution will most likely
make the GBRT slower at prediction time (especially for single instance
prediction).

best,
 Peter


2013/10/22 Attila Balogh attila.bal...@gmail.com

 Hi all,

 first of all thanks for all the developers for working on scikit-learn, it
 is a wonderful library.
 I am struggling for a while now with the following problem:
 Trying to use GBR with LR as a BaseEstimator, and I'm getting the
 following error:

  File main.py, line 110, in main
 score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4,
 scoring='roc_auc'))
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line
 1152, in cross_val_score
 for train, test in cv)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 517, in __call__
 self.dispatch(function, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 312, in dispatch
 job = ImmediateApply(func, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 136, in __init__
 self.results = func(*args, **kwargs)
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line
 1060, in _cross_val_score
 estimator.fit(X_train, y_train, **fit_params)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 890, in fit
 return super(GradientBoostingClassifier, self).fit(X, y)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 613, in fit
 random_state)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 486, in _fit_stage
 sample_mask, self.learning_rate, k=k)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 172, in update_terminal_regions
 y_pred[:, k])
 IndexError: too many indices

 I have found a similar problem on stackoverflow (
 http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn)
 and tried to implement the adaptor but it didn't help, the error remained
 the same.

 Does anyone have any ideas how to resolve this?

 Cheers;
 Attila


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression

2013-10-22 Thread Peter Prettenhofer

Right, I thought you were using the multi-class loss function.

Please send me a testcase so that I can investigate the issue.

thanks,
 Peter


2013/10/22 Attila Balogh attila.bal...@gmail.com

 Hi Peter,

 thanks for your answer. I have tried this before also, and the problem is
 that in this case I get
 ValueError: operands could not be broadcast together with shapes (74)
 (148), because the y array is raveled and it has shape (74,2).

 Do you need a self containing testcase which reproduces this error?

 Cheers;
 Attila


 On Tue, Oct 22, 2013 at 1:16 PM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 Hi Attila,

 please use the following adaptor::

 def __init__(self, est):
 self.est = est
 def predict(self, X):
 return self.est.predict_proba(X)
 def fit(self, X, y):
 self.est.fit(X, y)

 The one in the stackoverflow question returns an array of shape
 (n_samples,) but it should rather be (n_samples, n_classes).

 PS: I still need to fix the init issue but any solution will most likely
 make the GBRT slower at prediction time (especially for single instance
 prediction).

 best,
  Peter


 2013/10/22 Attila Balogh attila.bal...@gmail.com

  Hi all,

 first of all thanks for all the developers for working on scikit-learn,
 it is a wonderful library.
 I am struggling for a while now with the following problem:
 Trying to use GBR with LR as a BaseEstimator, and I'm getting the
 following error:

  File main.py, line 110, in main
 score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4,
 scoring='roc_auc'))
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line
 1152, in cross_val_score
 for train, test in cv)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 517, in __call__
 self.dispatch(function, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 312, in dispatch
 job = ImmediateApply(func, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 136, in __init__
 self.results = func(*args, **kwargs)
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line
 1060, in _cross_val_score
 estimator.fit(X_train, y_train, **fit_params)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 890, in fit
 return super(GradientBoostingClassifier, self).fit(X, y)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 613, in fit
 random_state)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 486, in _fit_stage
 sample_mask, self.learning_rate, k=k)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 172, in update_terminal_regions
 y_pred[:, k])
 IndexError: too many indices

 I have found a similar problem on stackoverflow (
 http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn)
 and tried to implement the adaptor but it didn't help, the error remained
 the same.

 Does anyone have any ideas how to resolve this?

 Cheers;
 Attila


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register
 

 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 

 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general

Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression

2013-10-22 Thread Peter Prettenhofer

Ok, below is the adaptor that will work. The code requires that the output
of predict is 2d.

Thanks for the test-case.

best,
 Peter

class Adaptor(object):

def __init__(self, est):
self.est = est
def predict(self, X):
return self.est.predict_proba(X)[:, np.newaxis]
def fit(self, X, y):
self.est.fit(X, y)



2013/10/22 Peter Prettenhofer peter.prettenho...@gmail.com

 Right, I thought you were using the multi-class loss function.

 Please send me a testcase so that I can investigate the issue.

 thanks,
  Peter


 2013/10/22 Attila Balogh attila.bal...@gmail.com

 Hi Peter,

 thanks for your answer. I have tried this before also, and the problem is
 that in this case I get
 ValueError: operands could not be broadcast together with shapes (74)
 (148), because the y array is raveled and it has shape (74,2).

 Do you need a self containing testcase which reproduces this error?

 Cheers;
 Attila


 On Tue, Oct 22, 2013 at 1:16 PM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 Hi Attila,

 please use the following adaptor::

 def __init__(self, est):
 self.est = est
 def predict(self, X):
 return self.est.predict_proba(X)
 def fit(self, X, y):
 self.est.fit(X, y)

 The one in the stackoverflow question returns an array of shape
 (n_samples,) but it should rather be (n_samples, n_classes).

 PS: I still need to fix the init issue but any solution will most likely
 make the GBRT slower at prediction time (especially for single instance
 prediction).

 best,
  Peter


 2013/10/22 Attila Balogh attila.bal...@gmail.com

  Hi all,

 first of all thanks for all the developers for working on scikit-learn,
 it is a wonderful library.
 I am struggling for a while now with the following problem:
 Trying to use GBR with LR as a BaseEstimator, and I'm getting the
 following error:

  File main.py, line 110, in main
 score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4,
 scoring='roc_auc'))
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py,
 line 1152, in cross_val_score
 for train, test in cv)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 517, in __call__
 self.dispatch(function, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 312, in dispatch
 job = ImmediateApply(func, args, kwargs)
   File
 C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line
 136, in __init__
 self.results = func(*args, **kwargs)
   File C:\Python27\lib\site-packages\sklearn\cross_validation.py,
 line 1060, in _cross_val_score
 estimator.fit(X_train, y_train, **fit_params)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 890, in fit
 return super(GradientBoostingClassifier, self).fit(X, y)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 613, in fit
 random_state)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 486, in _fit_stage
 sample_mask, self.learning_rate, k=k)
   File
 C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line
 172, in update_terminal_regions
 y_pred[:, k])
 IndexError: too many indices

 I have found a similar problem on stackoverflow (
 http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn)
 and tried to implement the adaptor but it didn't help, the error remained
 the same.

 Does anyone have any ideas how to resolve this?

 Cheers;
 Attila


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the
 most from
 the latest Intel processors and coprocessors. See abstracts and
 register 

 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register
 

 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 October Webinars: Code

Re: [Scikit-learn-general] linear_model.SGDClassifier(): ValueError: ndarray is not C-contiguous when calling partial_fit()

2013-10-09 Thread Peter Prettenhofer

great - thanks Lars - will prepare a PR


2013/10/9 Lars Buitinck larsm...@gmail.com

 2013/10/8 Peter Prettenhofer peter.prettenho...@gmail.com:
  that's a bug - I'll open a ticket for it.
  A quick fix: call partial_fit instead of fit just before the ``for``
 loop.

 Peter, is this due to an optimization that turns coef_ into a
 Fortran-ordered array? If so, I don't think we need it any longer with
 NumPy 1.7 and the new sklearn.extmath.fast_dot:

 In [1]: X = np.random.randn(1, 200)

 In [2]: Y = np.random.randn(200, 70)

 In [3]: %timeit np.dot(X, Y)
 100 loops, best of 3: 16.5 ms per loop

 In [4]: Yf = asfortranarray(Y)

 In [5]: %timeit np.dot(X, Yf)
 100 loops, best of 3: 16.7 ms per loop

 In [6]: numpy.__version__
 Out[6]: '1.7.1'


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] linear_model.SGDClassifier(): ValueError: ndarray is not C-contiguous when calling partial_fit()

2013-10-08 Thread Peter Prettenhofer

Hi Tom,

that's a bug - I'll open a ticket for it.
A quick fix: call partial_fit instead of fit just before the ``for`` loop.

- Peter


2013/10/4 Tom Kenter tom.ken...@uva.nl

 Dear all,

 I am trying to run a linear_model.SGDClassifier() and have it update after
 every example it classifies.
 My code works for a small feature file (10 features), but when I give it a
 bigger feature file (some 8 features, but very sparse) it keeps giving
 me errors straight away, the first time partial_fit() is called.

 This is what I do in pseudocode:

 X, y = load_svmlight_file(train_file)
 classifier = linear_model.SGDClassifier()
 classifier.fit(X, y)

 for every test_line in test file:
   test_X, test_y = getFeatures(test_line)
   # This gives me a Python list for X
   # and an integer label for y

   print prediction: %f % = classifier.predict([test_X])

   classifier.partial_fit(csr_matrix([test_X]),
  csr_matrix([Y_GroundTruth])
  classes=np.unique(y) )

 The error I keep getting for the partial_fit() line is:

   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py,
 line 487, in partial_fit
 coef_init=None, intercept_init=None)
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py,
 line 371, in _partial_fit
 sample_weight=sample_weight, n_iter=n_iter)
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py,
 line 451, in _fit_multiclass
 for i in range(len(self.classes_)))
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py,
 line 517, in __call__
 self.dispatch(function, args, kwargs)
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py,
 line 312, in dispatch
 job = ImmediateApply(func, args, kwargs)
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py,
 line 136, in __init__
 self.results = func(*args, **kwargs)
   File
 /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py,
 line 284, in fit_binary
 est.power_t, est.t_, intercept_decay)
   File sgd_fast.pyx, line 327, in
 sklearn.linear_model.sgd_fast.plain_sgd
 (sklearn/linear_model/sgd_fast.c:7568)
 ValueError: ndarray is not C-contiguous

 I also tried feeding partial.fit() Python arrays, or numpy arrays (which
 are C-contiguous (sort=C) by default, I thought), but this gives the same
 result.
 The classes attribute is not the problem I think. The same error appears
 if I leave it out or if I give the right classes in hard code.

 I do notice that when I print the flags of the _coef array of the
 classifier, it says:

 Flags of coef_ array:
   C_CONTIGUOUS : False
   F_CONTIGUOUS : True
   OWNDATA : True
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False

 I am sure I am doing something wrong, but really, I don't see what...

 Any help appreciated!

 Cheers,

 Tom


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Right place for a time-series focused algorithm?

2013-09-26 Thread Peter Prettenhofer

2013/9/26 Kyle Kastner kastnerk...@gmail.com

 I had not thought about use inside a Pipeline - though now that you
 mention it, that seems like the ideal use case for an algorithm like this.
 Is this the  PR you mentioned?
 https://github.com/scikit-learn/scikit-learn/pull/1454

 As far as lagged features transformer - are we talking about rolling
 statistics? Something similar to pandas rolling_mean, rolling_apply, etc.?
 I have poorly reimplemented that using ```stride_tricks``` more times than
 I probably should have...


well... I was mostly thinking of fx val at lag_1, fx at lag_2, ... so
feature values at previous time steps.



 I will work up a gist for SAX in the next few days, and post it here.
 There is a nice demo of turning time-series into bitmaps which I rather
 like. If I linked the right issue above, I will try to hop in there and
 catch up on the changes. Resampling in the pipeline also opens the door for
 very interesting things from a time-series perspective...

 Kyle



 On Thu, Sep 26, 2013 at 6:10 AM, Olivier Grisel 
 olivier.gri...@ensta.orgwrote:

 2013/9/25 Peter Prettenhofer peter.prettenho...@gmail.com:
  Hi Kyle,
 
  personally, I'd love to see SAX in sklearn or any other python library
 that
  I could easily use with sklearn. We don't have any time-series specific
  functionality yet (eg. lagged features transformer). So if we choose to
 add
  time-series functionality we should also consider the basics.
 
  Lets hear what the others say about this.
 
  PS: I'd not put it into decomposition but rather
 feature_extraction.tseries
  or something along those lines.

 I would start by implementing lagged features transformer as gist or
 as an example script to experiment how it would (or not) fit with the
 current scikit-learn API.

 We might have a problem though: the current Pipeline tool does not
 support changing the number of samples in a data which would probably
 be required for TS forecasting stuff. We have a similar issue for
 resampling transformers (for instance for dealing with class
 imbalance).

 We should probably make the Pipeline more flexible first to be able to
 properly address TS tasks.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 

 http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Right place for a time-series focused algorithm?

2013-09-25 Thread Peter Prettenhofer

Hi Kyle,

personally, I'd love to see SAX in sklearn or any other python library that
I could easily use with sklearn. We don't have any time-series specific
functionality yet (eg. lagged features transformer). So if we choose to add
time-series functionality we should also consider the basics.

Lets hear what the others say about this.

PS: I'd not put it into decomposition but rather feature_extraction.tseries
or something along those lines.

best,
 Peter


2013/9/25 Kyle Kastner kastnerk...@gmail.com

 I have recently been working with time-series data extensively and looking
 at different ways to model, classify, and predict different types of
 time-series.

 One algorithm I have been playing with is called SAX (
 http://www.cs.ucr.edu/~eamonn/SAX.htm). It is a very straightforward
 algorithm (basically windowed mean with no overlap, then quantize into M
 levels), and I have implemented a rough version using numpy. Despite its
 simplicity, it is shown as being an effective data dependent transform,
 similar in some ways to the DWT.

 I think this algorithm would be a nice tie-in to sklearn, which could
 allow for more of sklearn's algorithms to be used on time-series type data.
 Also, the algorithm makes very strong claims about indexing massive
 datasets, finding similarities and outliers, which are all things I am
 planning to explore in the future.

 I know that FastICA is under decomposition, and is often seen in a
 time-series context  - would symbolic aggregation fall into the
 decomposition camp as well? Is sklearn even the right place for this?

 Kyle


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
 from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Representing classifiers outside of Python

2013-09-23 Thread Peter Prettenhofer

We don't have a PMML interface yet [1] - so you need to write custom code
to extract internal state each individual classifier.

What do you mean by performance critical (1ms, 1ms)? Do you make
predictions per sample or can you buffer samples and make predictions for
batches?
In general, what kills performance is the overhead of python function calls
- its usually way larger than the actual prediction (which usually happens
in C-land).

[1] http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language


2013/9/23 Fred Baba fred.b...@gmail.com

 I'd like to use classifiers trained via sklearn in a real-time
 application, performance critical application. How do I access the internal
 representation of trained classifiers?

 For linear classifiers/regressions, I can simply store the coefficients
 and generate the linear combination myself. For tree regressions, I can use
 sklearn.tree.export_graphviz. Ideally there would be an export facility for
 all classifiers (particularly for examining the structure of generated
 models). Is there a general solution way to do this?



 --
 LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
 SharePoint
 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
 includes
 Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
 http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Selective multiclass

2013-08-13 Thread Peter Prettenhofer

This is strange indeed - since you said you're doing text classification I
suppose X is sparse? which format (csr, csc) and dtype (float64,32) are you
using?

The coef matrix is allocated before the sub processes are forked so you
will need (n_jobs + 1) * 12 gb just for the coefs.

The systemerror is quite strange though... I would expect a memory error...
Lars, do you have any thoughts on this?

best,
Peter
Am 13.08.2013 22:10 schrieb A 4rk@gmail.com:

  
   I have 64G of memory, so I do not think memory is the issue in this
 case.
  If the features are dense, the n_classes many coefficients of n_features
  are 12gb (if I haven't miss-calculated).
 - Correct, it occupies about 12.5G
  If they are for some reason all replicated for all cores, you would get
  into trouble.
 - Note that same is the case with njobs=2,3,4, just to clarify that
  without using all cores, even if structures are replicated for cores,
  the max available should be enough in this case atleast(n_jobs=2,3,4),
 correct?






 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite!
 It's a free troubleshooting tool designed for production.
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with 2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] PyStruct 0.1 released

2013-08-11 Thread Peter Prettenhofer

Congrats Andy - looking forward to tinker wit it!
Am 11.08.2013 19:57 schrieb Andreas Mueller amuel...@ais.uni-bonn.de:

 Hey everybody.

 I just wanted to spam the ML again and say I just released PyStruct 0.1.
 It contains structured support vector machines, structured perceptrons
 and models for multi-label prediction, graph labeling and sequence
 prediction.

 There are some examples on the website:

 http://pystruct.github.io/auto_examples/index.html


 You can now install it from the cheeseshop:

 pip install pystruct

 That should also give you ad3 and pyqpbo.
 You can then run the tests with

 nosetests pystruct

 Thanks to all the people who helped me make that happen :)
 Feedback, also to installation troubles, is very welcome!

 Cheers,
 Andy


 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite!
 It's a free troubleshooting tool designed for production.
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with 2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Peter Prettenhofer

2013/7/12 Andreas Mueller amuel...@ais.uni-bonn.de

 On 07/12/2013 01:26 AM, Robert Layton wrote:
  Structured prediction in sklearn was one of the outcomes from the survey.
  Would it be a better idea to send people to pystruct, rather than
  implement it here?
 
 I think so. We decided that structured prediction was out of scope for
 sklearn, right?
 I tried a simple approach for encoding the inputs - which is basically
 tuples of nd-arrays for each instance -
 but I'm not sure that will really scale. I might need custom classes to
 encode the input.
 Also, the project moves way faster than sklearn does currently.
 Rob Zinkov asked me when pystruct will be included in scikit-learn. My
 answer was: never ;)


I think its much better to have it as a separate project - this way you can
iron out the API much faster



 Of course you can try to convince me otherwise once pystruct is more
 mature, but I think
 the difference in target group and input format is quite big. Also, the
 project has a ton of requirements
 - we are working to make this more manageable but having cvxopt as a
 hard requirement
 is probably necessary.

 About naming it scikit-struct: is there any requirement to become a scikit?
 Also: is there much benefit - pandas seems to be doing quite well
 without the brand ;)


totally agree



 Cheers,
 Andy


 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer

Hi Anne,

I would also expect that using uniform weights should result in the same
solution as no weights -- but maybe there is an interaction with the C
parameter... for this we would need to know more about the internals of
libsvm and how it handles sample weights - try scaling C by
``len(y_train)`` and see what you get :-)

PS: if you use the linear svm implemented by SGDClassifier(loss='hinge')
you would also get this effect that uniform weights scale the
regularization parameter.

best,
 Peter


2013/7/12 Anne Dwyer anne.p.dw...@gmail.com

 I have been using the sonar data set (I believe this is a sample data set
 used in many demonstrations of machine learning.) It is a two class data
 set with 60 features with 208 training examples.

 I have a questions about using sample weights in fitting the SVM model.

 When I fit the model using scaled data, I get a test error of 10.3%. When
 I fit the model using a sample weight vector of 1/N, I get a test error of
 37%.

 Here is the code:

 w=np.ones(len(y_train))

 clf=svm.SVC(kernel='rbf', C=10, gamma=.01)
 clf.fit(x_tr_scaled,y_train)

 score_scaled_tr=clf.score(x_tr_scaled,y_train)

 score_scaled_test=clf.score(x_te_scaled,y_test)

 w=w/sum(w)

 clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True)

 clf1.fit(x_tr_scaled,y_train,sample_weight=w)

 print Training score with sample weights is , clf1.score(x_tr,y_train)

 print Score with sample weights is, clf1.score(x_te_scaled,y_test)

 What am I doing wrong here?

 Also, when I tried this command:

 Pr=predict_proba(x_tr_scaled)

 I get the error that predict_proba is an undefined name. However, I got it
 from this link:
 http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

 Any help would be appreciated.

 Anne Dwyer





 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer

2013/7/12 Peter Prettenhofer peter.prettenho...@gmail.com

 Hi Anne,

 I would also expect that using uniform weights should result in the same
 solution as no weights -- but maybe there is an interaction with the C
 parameter... for this we would need to know more about the internals of
 libsvm and how it handles sample weights - try scaling C by
 ``len(y_train)`` and see what you get :-)

 PS: if you use the linear svm implemented by SGDClassifier(loss='hinge')
 you would also get this effect that uniform weights scale the
 regularization parameter.

 best,
  Peter


 2013/7/12 Anne Dwyer anne.p.dw...@gmail.com

 I have been using the sonar data set (I believe this is a sample data set
 used in many demonstrations of machine learning.) It is a two class data
 set with 60 features with 208 training examples.

 I have a questions about using sample weights in fitting the SVM model.

 When I fit the model using scaled data, I get a test error of 10.3%. When
 I fit the model using a sample weight vector of 1/N, I get a test error of
 37%.

 Here is the code:

 w=np.ones(len(y_train))

 clf=svm.SVC(kernel='rbf', C=10, gamma=.01)
 clf.fit(x_tr_scaled,y_train)

 score_scaled_tr=clf.score(x_tr_scaled,y_train)

 score_scaled_test=clf.score(x_te_scaled,y_test)

 w=w/sum(w)

 clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True)

 clf1.fit(x_tr_scaled,y_train,sample_weight=w)

 print Training score with sample weights is , clf1.score(x_tr,y_train)

 print Score with sample weights is, clf1.score(x_te_scaled,y_test)

 What am I doing wrong here?

 Also, when I tried this command:

 Pr=predict_proba(x_tr_scaled)

 I get the error that predict_proba is an undefined name. However, I got
 it from this link:
 http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC


you forgot the object::

Pr = clf1.predict_proba(x_tr_scaled)




 Any help would be appreciated.

 Anne Dwyer





 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer




-- 
Peter Prettenhofer
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer

try float(len(y_train)) - seems like C default is int...
Am 13.07.2013 00:10 schrieb Anne Dwyer anne.p.dw...@gmail.com:

 Peter,

 Thanks for your answers. When I scale C by len(y_train), I get the
 following error:

 ValueError: C = 0

 Anne Dwyer


 On Fri, Jul 12, 2013 at 3:34 PM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 Hi Anne,

 I would also expect that using uniform weights should result in the same
 solution as no weights -- but maybe there is an interaction with the C
 parameter... for this we would need to know more about the internals of
 libsvm and how it handles sample weights - try scaling C by
 ``len(y_train)`` and see what you get :-)

 PS: if you use the linear svm implemented by SGDClassifier(loss='hinge')
 you would also get this effect that uniform weights scale the
 regularization parameter.

 best,
  Peter


 2013/7/12 Anne Dwyer anne.p.dw...@gmail.com

  I have been using the sonar data set (I believe this is a sample data
 set used in many demonstrations of machine learning.) It is a two class
 data set with 60 features with 208 training examples.

 I have a questions about using sample weights in fitting the SVM model.

 When I fit the model using scaled data, I get a test error of 10.3%.
 When I fit the model using a sample weight vector of 1/N, I get a test
 error of 37%.

 Here is the code:

 w=np.ones(len(y_train))

 clf=svm.SVC(kernel='rbf', C=10, gamma=.01)
 clf.fit(x_tr_scaled,y_train)

 score_scaled_tr=clf.score(x_tr_scaled,y_train)

 score_scaled_test=clf.score(x_te_scaled,y_test)

 w=w/sum(w)

 clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True)

 clf1.fit(x_tr_scaled,y_train,sample_weight=w)

 print Training score with sample weights is , clf1.score(x_tr,y_train)

 print Score with sample weights is, clf1.score(x_te_scaled,y_test)

 What am I doing wrong here?

 Also, when I tried this command:

 Pr=predict_proba(x_tr_scaled)

 I get the error that predict_proba is an undefined name. However, I got
 it from this link:
 http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

 Any help would be appreciated.

 Anne Dwyer





 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer


 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!

 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Paris Sprint location

2013-07-11 Thread Peter Prettenhofer

I plan on merging some of the GBRT PRs and praise Gilles new decision tree
impl.


2013/7/11 Lars Buitinck l.j.buiti...@uva.nl

 2013/7/11 Mathieu Blondel math...@mblondel.org:
  What is everyone planning to work on? Just curious :)

 Py3 was my aim, but that seems to be almost tackled, so I guess I'll
 concentrate on getting my proposed scorer API in master.
 I might want to try my hand at implementing quadratic features in
 FeatureHasher.

 --
 Lars Buitinck
 Scientific programmer, ILPS
 University of Amsterdam


 --
 See everything from the browser to the database with AppDynamics
 Get end-to-end visibility with application monitoring from AppDynamics
 Isolate bottlenecks and diagnose root cause in seconds.
 Start your free trial of AppDynamics Pro today!
 http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Extremely poor SVM performance

2013-07-08 Thread Peter Prettenhofer

What is actually quite interesting is that the worst model has AUC of
0.29 which is actually AUC 0.71 if you invert the predictions.


2013/7/8 Olivier Grisel olivier.gri...@ensta.org

 Alternatively you can use the `score_func=f1_score` in 0.13 look for
 models that trade off precision and recall on unbalanced datasets.

 --
 Olivier


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RandomForests - where do we select a subset of features during fitting?

2013-07-07 Thread Peter Prettenhofer

Hi Ian,


2013/7/7 Ian Ozsvald i...@ianozsvald.com

 Hi all. I'm following the RandomForest code (in dev from a 1 week old
 checkout). As I understand it (and similar to the previous post - I
 have some RF usage experience but nothing fundamental), RF uses a
 weighted sample of examples to learn *and* a random subset of features
 when building its decision trees.


correct - although weighted samples are optional - usually, RF takes a
bootstrap sample and this is implemented via sample_weights (e.g. a sample
that is picked two times for the bootstrap has weight 2.0)


 Does the scikit-learn implementation use a random subset of features?
 I've followed the code in forest.py and I can't find where the choice
 might be made. I haven't looked at the C code for the DecisionTree.


Its in the implementation of DecisionTree - see sklearn/tree/_tree.pyx -
look for the for loop over ``features``.



 I'm interested to learn the lower bound of the number of random
 features that can be chosen.


could you elaborate on that?



 I'm also curious to understand where we can restrict the depth of the
 RandomForest classifier. All I can see is that in forest.py the
 constructor takes but ignores the max_depth argument:
 class RandomForestClassifier(ForestClassifier):
 ...
 def __init__(self,
  n_estimators=10,
  criterion=gini,
  max_depth=None,
 ...
 super(RandomForestClassifier, self).__init__(
 base_estimator=DecisionTreeClassifier(),
 ...

 base.py._make_estimator just clones the existing base_estimator. Am I
 missing something?


after cloning it calls ``set_params`` with ``estimator_params`` -
``'max_depth'`` is one of those.

best,
 Peter



 Thanks for listening,
 Ian.

 --
 Ian Ozsvald (A.I. researcher)
 i...@ianozsvald.com

 http://IanOzsvald.com
 http://MorConsulting.com/
 http://Annotate.IO
 http://SocialTiesApp.com/
 http://TheScreencastingHandbook.com
 http://FivePoundApp.com/
 http://twitter.com/IanOzsvald
 http://ShowMeDo.com


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Questions for plot_forest_iris.py and AdaBoost

2013-07-07 Thread Peter Prettenhofer

2013/7/7 Ian Ozsvald i...@ianozsvald.com

 Hi all. I have a couple of questions about the demo image for the
 AdaBoost classifier in the dev branch:
 http://scikit-learn.org/dev/auto_examples/ensemble/plot_forest_iris.html

 I've worked through the underlying code, I understand what's being
 plotted, I think the AdaBoost example (final column) is in error. I
 figured checking my reasoning made sense before filing a bug report (I
 have some possible patches too).

 The first column is for a DecisionTree (with no limits on tree depth),
 the plot makes sense.

 The second and third columns are for a RandomForest and ExtraTrees
 classifier (with DecisionTrees with no depth limit). The plots for
 columns 2 and 3 are made by iterating over the 30 classifiers and
 plotting each decision surface with an alpha of 0.1.

 The fourth column is for an AdaBoost classifier using a DecisionTree
 with no limit on max depth. The plots in this column don't look right
 - the red regions clearly encompass where the yellow dots are drawn
 (this is particularly obvious in the bottom-right plot).

 The problem is that the weights for the ensemble of classifiers in
 AdaBoost aren't taken into account, I believe the alpha value for the
 plot should use these weights. This raises another problem but let me
 check first - does my logic (weights being required for the plot to
 make sense) sound ok?


I think you are correct - we should definitely fix that - lets create an
issue for that.



 Checking clf.score (and calling clf.predict in the yellow regions)
 show that the underlying classifications are correct (in the yellow
 regions with AdaBoost the yellow class is chosen). I'm pretty
 confident it is just the display that's in error.

 I guess possibly the display is meant to force the user to question
 why the classifications look wrong and to reason about the weights in
 AdaBoost, but I'm probably overthinking this!

 Regards,
 Ian.


 --
 Ian Ozsvald (A.I. researcher)
 i...@ianozsvald.com

 http://IanOzsvald.com
 http://MorConsulting.com/
 http://Annotate.IO
 http://SocialTiesApp.com/
 http://TheScreencastingHandbook.com
 http://FivePoundApp.com/
 http://twitter.com/IanOzsvald
 http://ShowMeDo.com


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Questions for plot_forest_iris.py and AdaBoost

2013-07-07 Thread Peter Prettenhofer

Issue is here https://github.com/scikit-learn/scikit-learn/issues/2133


2013/7/7 Peter Prettenhofer peter.prettenho...@gmail.com




 2013/7/7 Ian Ozsvald i...@ianozsvald.com

 Hi all. I have a couple of questions about the demo image for the
 AdaBoost classifier in the dev branch:
 http://scikit-learn.org/dev/auto_examples/ensemble/plot_forest_iris.html

 I've worked through the underlying code, I understand what's being
 plotted, I think the AdaBoost example (final column) is in error. I
 figured checking my reasoning made sense before filing a bug report (I
 have some possible patches too).

 The first column is for a DecisionTree (with no limits on tree depth),
 the plot makes sense.

 The second and third columns are for a RandomForest and ExtraTrees
 classifier (with DecisionTrees with no depth limit). The plots for
 columns 2 and 3 are made by iterating over the 30 classifiers and
 plotting each decision surface with an alpha of 0.1.

 The fourth column is for an AdaBoost classifier using a DecisionTree
 with no limit on max depth. The plots in this column don't look right
 - the red regions clearly encompass where the yellow dots are drawn
 (this is particularly obvious in the bottom-right plot).

 The problem is that the weights for the ensemble of classifiers in
 AdaBoost aren't taken into account, I believe the alpha value for the
 plot should use these weights. This raises another problem but let me
 check first - does my logic (weights being required for the plot to
 make sense) sound ok?


 I think you are correct - we should definitely fix that - lets create an
 issue for that.



 Checking clf.score (and calling clf.predict in the yellow regions)
 show that the underlying classifications are correct (in the yellow
 regions with AdaBoost the yellow class is chosen). I'm pretty
 confident it is just the display that's in error.

 I guess possibly the display is meant to force the user to question
 why the classifications look wrong and to reason about the weights in
 AdaBoost, but I'm probably overthinking this!

 Regards,
 Ian.


 --
 Ian Ozsvald (A.I. researcher)
 i...@ianozsvald.com

 http://IanOzsvald.com
 http://MorConsulting.com/
 http://Annotate.IO
 http://SocialTiesApp.com/
 http://TheScreencastingHandbook.com
 http://FivePoundApp.com/
 http://twitter.com/IanOzsvald
 http://ShowMeDo.com


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Meaning of l1_ratio in SGDRegressor

2013-07-02 Thread Peter Prettenhofer

Andy, can you comment on this? Seems like the l1_ratio is indeed not
correct - code is a bit confusing since we change rho - l1_ratio - rho
again... We should open an issue for that.


2013/7/2 Mark Levy mark.l...@mendeley.com

 Hi there,

 In the docstring of SGDRegressor it says l1_ratio=0 corresponds to L2
 penalty, l1_ratio=1 to L1.
 But looking at the implementation, self.l1_ratio is passed as the value of
 the rho argument to plain_sgd(),
 and there I see:

 if penalty_type == L2:
 rho = 1.0
 elif penalty_type == L1:
 rho = 0.0

 Is there some confusion here, aside from in my head?

 Thanks!

 Mark


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Adding Sparse Autoencoder to Scikit

2013-06-26 Thread Peter Prettenhofer

I strongly recommend reading Jake's blog entries on Cython (Memoryviews in
particular) [1] and Wes' blog [2],[3].
Another great resource is the ball_tree.pyx code in
/sklearn/neighbors/ball_tree.pyx .

when you compile the pyx file to c using cython you should use the flag
-a - it will generate a html file that shows what C code has been
generated for the corresponding Cython statements.

best,
 Peter


[1] http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/
[2] http://wesmckinney.com/blog/?p=215
[3] http://wesmckinney.com/blog/?p=215


2013/6/26 Robert Layton robertlay...@gmail.com

 The basics of cython are, and I'm not kidding here, quite easy to learn.
 Steps:
 1) Rename .py file to .pyc
 2) Put int in front of all object declarations that will be integers,
 float in front of things that are floats. (If you know java/C/C++ etc,
 this will feel really natural)
 3) Compile with cython  - *cython filename.pyc*
 4) Done.

 After that, it gets slightly more complicated -- i.e. importing properly
 and using cdef etc.
 I can never remember the method to do numpy arrays, but google helps with
 that.

 Good luck!


 On 26 June 2013 03:27, Issam issamo...@gmail.com wrote:

 Very helpful information! Thanks @Olivier!

 I'll do my best!


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 Public key at: http://pgp.mit.edu/ Search for this email address and
 select the key from 2011-08-19 (key id: 54BA8735)


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-21 Thread Peter Prettenhofer

? you already use one-hot encoding in your example (
preprocessing.OneHotEncoder)


2013/6/21 Maheshakya Wijewardena pmaheshak...@gmail.com

 can anyone give me a sample algorithm for one hot encoding used in
 scikit-learn?


 On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 you can try an ordinal encoding instead - just map each categorical value
 to an integer so that you end up with 8 numerical features - if you use
 enough trees and grow them deep it may work


 2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com

 And yes Gilles, It is the Amazon challenge :D


 On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 The shape of X after encoding is (32769, 16600). Seems as if that is
 too big to be converted into a dense matrix. Can Random forest handle this
 amount of features?


 On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel 
 olivier.gri...@ensta.org wrote:

 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
  2013/6/20 Olivier Grisel olivier.gri...@ensta.org:
  Actually twice as much, even on a 32-bit platform (float size is
  always 64 bits).
 
  The decision tree code always uses 32 bits floats:
 
 
 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
 
  but you have to cast your data to `dtype=np.float32` in fortran
 layout
  ahead of time to avoid the memory copy.
 
  OneHot produces np.float, though, which is float64.

 Alright but you could convert it to np.float32 before calling toarray.
 But anyway this kind of sparsity level is unsuitable for random
 forests anyways I think.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general





 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer

Hi,

seems like your sparse matrix is too large to be converted to a dense
matrix. What shape does X have? How many categorical variables do you have
(before applying the OneHotTransformer)?
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer

you can try an ordinal encoding instead - just map each categorical value
to an integer so that you end up with 8 numerical features - if you use
enough trees and grow them deep it may work


2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com

 And yes Gilles, It is the Amazon challenge :D


 On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 The shape of X after encoding is (32769, 16600). Seems as if that is too
 big to be converted into a dense matrix. Can Random forest handle this
 amount of features?


 On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel olivier.gri...@ensta.org
  wrote:

 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl:
  2013/6/20 Olivier Grisel olivier.gri...@ensta.org:
  Actually twice as much, even on a 32-bit platform (float size is
  always 64 bits).
 
  The decision tree code always uses 32 bits floats:
 
 
 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
 
  but you have to cast your data to `dtype=np.float32` in fortran layout
  ahead of time to avoid the memory copy.
 
  OneHot produces np.float, though, which is float64.

 Alright but you could convert it to np.float32 before calling toarray.
 But anyway this kind of sparsity level is unsuitable for random
 forests anyways I think.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general





 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] test failed after installaing scikit

2013-06-06 Thread Peter Prettenhofer

could it be that the folder you're in (~/scikit-learn) contains the
scikit-learn sources?


2013/6/6 linxpwww linxp...@163.com

  all,
 IN my ubuntu(uname -a): Linux ubuntu 3.2.0-29-generic-pae #46-Ubuntu SMP
 Fri Jul 27 17:25:43 UTC 2012 i686 i686 i386 GNU/Linux,
 after installing the scikit-learn from source package followd by
 https://pypi.python.org/pypi/scikit-learn/ ,
 run 'nosetests --exe sklearn' ,following error happens:
 root@ubuntu:~/scikit-learn# nosetests --exe sklearn
 E
 ==
 ERROR: Failure: ImportError (No module named _check_build
 ___
 Contents of /root/scikit-learn/sklearn/__check_build:
 _check_build.pyx  setup.pyc __init__.py
 _check_build.c__init__.pyc  setup.py
 ___
 It seems that scikit-learn has not been built correctly.

 If you have installed scikit-learn from source, please do not forget
 to build the package before using it: run `python setup.py install` or
 `make` in the source directory.

 If you have used an installer, please check that it is suited for your
 Python version, your operating system and your platform.)
 --
 Traceback (most recent call last):
   File /usr/lib/python2.7/dist-packages/nose/loader.py, line 390, in
 loadTestsFromName
 addr.filename, addr.module)
   File /usr/lib/python2.7/dist-packages/nose/importer.py, line 39, in
 importFromPath
 return self.importFromDir(dir_path, fqname)
   File /usr/lib/python2.7/dist-packages/nose/importer.py, line 86, in
 importFromDir
 mod = load_module(part_fqname, fh, filename, desc)
   File /root/scikit-learn/sklearn/__init__.py, line 31, in module
 from . import __check_build
   File /root/scikit-learn/sklearn/__check_build/__init__.py, line 46, in
 module
 raise_build_error(e)
   File /root/scikit-learn/sklearn/__check_build/__init__.py, line 41, in
 raise_build_error
 %s % (e, local_dir, ''.join(dir_content).strip(), msg))
 ImportError: No module named _check_build
 ___
 Contents of /root/scikit-learn/sklearn/__check_build:
 _check_build.pyx  setup.pyc __init__.py
 _check_build.c__init__.pyc  setup.py
 ___
 It seems that scikit-learn has not been built correctly.

 If you have installed scikit-learn from source, please do not forget
 to build the package before using it: run `python setup.py install` or
 `make` in the source directory.

 If you have used an installer, please check that it is suited for your
 Python version, your operating system and your platform.

 --
 Ran 1 test in 0.001s

 FAILED (errors=1)

 There is no any errors during building and installing, could you help me?

 Thanks
 Aaron






 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Peter Prettenhofer

Hi Christian,

I believe more in my results than in my expertise - and so should you :-) **

I think you misunderstood me: I did not claim that one-hot encoded
categorical features give better results than ordinal encoded ones - I just
claimed that ordinal encoding works as good as one-hot encoded features
given that you have deep enough trees. But I've to warn you: I cannot
support my claim with (sufficient) data. So at the end of the day, its
always best to make an experiment and test it on your problem at hand.

Anyways, I cannot really see your problem (or what you did wrong):
according to your description it seems that the specific encoding (one-hot
vs. ordinal) has no influence on the effectiveness of the model (no
significant difference)? This is in line with observations by others.

Andy raised a very important point though: if you optimized your
hyperparameters (tree depth, min split size, ..) on the ordinal encoding
and then tested those hyperparameters on a one-hot encoding you are giving
an advantage to the ordinal encoding.

HTH,
 Peter

** that being said, I'm still quite skeptical when it comes to my results


2013/6/4 Christian Jauvin cjau...@gmail.com

 Many thanks to all for your help and detailed answers, I really appreciate
 it.

 So I wanted to test the discussion's takeaway, namely, what Peter
 suggested: one-hot encode the categorical features with small
 cardinality, and leave the others in their ordinal form.

 So from the same dataset I mentioned earlier, I picked another subset
 of 5 features, this time all with small cardinality (5, 5, 6, 11 and
 12), and all purely categorical (i.e. clearly not ordered). The
 one-hot encoding should clearly help with such a configuration.

 But again, what I observe when I pit the fully one-hot encoded RF
 (21000 x 39) against the ordinal-encoded one (21000 x 5) is that
 they're behaving almost the same, in terms of accuracy and AUC, with
 10-fold cross-validation. In fact, the ordinal version even seems to
 perform very slightly better, although I don't think it's significant.

 I really believe in your expertise more than in my results, so what
 could I be doing wrong?



 On 3 June 2013 04:56, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
  On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
  Our decision tree implementation only supports numerical splits; i.e.
  if tests val  threshold .
 
  Categorical features need to be encoded properly. I recommend one-hot
  encoding for features with small cardinality (e.g.  50) and ordinal
  encoding (simply assign each category an integer value) for features
  with large cardinality.
  This seems to be the opposite of what the kaggle tutorial suggests,
  right? They suggest ordinal encoding for small cardinality, but don't
  suggest
  any other way.
 
  Your and Gilles' feedback make me think we should tell the kaggle people
  to change their tutorial
 
 
 --
  Get 100% visibility into Java/.NET code with AppDynamics Lite
  It's a free troubleshooting tool designed for production
  Get down to code-level detail for bottlenecks, with 2% overhead.
  Download for free and get started troubleshooting in minutes.
  http://p.sf.net/sfu/appdyn_d2d_ap2
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] ROC for OneClassSVM

2013-05-08 Thread Peter Prettenhofer

Hi Carlos,

take a look at the species distribution example [1].

Summary: Use ``OneClassSVM.decision_function`` - you don't necessarily need
probabilities for ROC/AUC - confidence values are fine.

best,
 Peter

[1]
http://scikit-learn.org/stable/auto_examples/applications/plot_species_distribution_modeling.html#example-applications-plot-species-distribution-modeling-py



2013/5/7 ctme...@unizar.es

 OK, thank you. I will do it in that way

 Carlos

 Quoting scikit-learn-general-requ...@lists.sourceforge.net:

  Today's Topics:
 
 1. Re: ROC for OneClassSVM (Andreas Mueller)
  --
 
  Message: 1
  Date: Mon, 06 May 2013 12:33:03 +0200
  From: Andreas Mueller amuel...@ais.uni-bonn.de
  Subject: Re: [Scikit-learn-general] ROC for OneClassSVM
  To: scikit-learn-general@lists.sourceforge.net
  Message-ID: 518786df.7000...@ais.uni-bonn.de
  Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 
  On 05/06/2013 12:27 PM, ctme...@unizar.es wrote:
  Hello,
 
 I would like to use OneClassSVM for novelty detection. I have some
  'normal' data for fitting the classifier. Then I have 'normal' and
  'abnormal' data for testing the performance.
 
 I would like to use the area under the ROC curve as the figure of
  merit of the detector. The function roc_curve needs the predicted
  probability. I have read that the probability can be obtained if the
  classifier is obtained with the parameter probability = True. However,
  I get an error when I try to pass this parameter.
 
 I am using version 0.10 of sklearn.
 
 For instance:
 
 import sklearn
 import sklearn.metrics
 import scipy
 import sklearn.svm
 
 X = scipy.random.randn(100, 2)
 
 X_train = scipy.r_[X + 2, X - 2]
 
 clf = sklearn.svm.OneClassSVM(nu=0.1, kernel=rbf, gamma=0.1,
  probability=True)
 
 Then I get an error. I have also tried
 
 clf = sklearn.svm.OneClassSVM(nu=0.1, kernel=rbf, gamma=0.1)
 clf.fit(X_train, probability=True)
 
 but it is again an error.
 
 Is that option available for OneClassSVM? If not, how could I draw
  the ROC? Could I sweep a threshold on the distance to the hyperplane
  given by clf.decision_function?
 
  Yes, I think this is what you should do.
  Hth,
  Andy
 
 




 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and
 their applications. This 200-page book is written by three acclaimed
 leaders in the field. The early access version is available now.
 Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Better prediction probabilities with SVM

2013-05-07 Thread Peter Prettenhofer

2013/5/7 Lars Buitinck l.j.buiti...@uva.nl

 2013/5/7 Peter Prettenhofer peter.prettenho...@gmail.com:
  Do you need probabilities? You could just use the signed distance to each
  OVA hyperplane (via ``clf.decision_function()``) to rank the classes.
 Maybe
  the platt-scaling screws up here...

 The more I find out about Platt scaling in LibSVM, the more I'm
 inclined to stay away from it.

  You could also look at Mathieu's lightning project
  https://github.com/mblondel/lightning  - it features multinomial
 logistic
  regression which might give better calibrated probabilities than platt
  scaling...

 Or our own LogisticRegression. It cuts some corners, but sometimes
 it's good enough.


Right, it should give you the same ordering as ``decision_function`` (just
normalized).



 --
 Lars Buitinck
 Scientific programmer, ILPS
 University of Amsterdam


 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and
 their applications. This 200-page book is written by three acclaimed
 leaders in the field. The early access version is available now.
 Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoC 2013 : Multinomial Logistic Regression

2013-05-02 Thread Peter Prettenhofer

2013/5/2 Mathieu Blondel math...@mblondel.org



 On Thu, May 2, 2013 at 5:21 PM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:

 this looks pretty awesome - especially the dataset abstraction is pretty
 neat - would be great if we could merge this into scikit-learn.


 Merging the dataset abstraction would be nice. We could port some of
 scikit-learn's code to it, including SGD and mini-batch k-means. The neural
 network PR by Lars could also benefit it.


totally agree - I can raise this issue and work on it at the sprint -
shouldn't take too long - we would need to port SGD first anyways.



 BTW, do you think we should keep the weight vector abstraction which is in
 scikit-learn?


The idea behind the abstraction was to implement averaged SGD/Perceptron
easily - I didn't finish the PR though...
So I guess the answer is: no




 btw: what kind of truncated gradient algorithm does lightning use for L1
 penalized SGD? As far as I can see its not the one that's currently used in
 SGDClassifier...


 It's the regular truncated SGD by Jonh Langford, which is identical to the
 method described in the FOBOS paper. Compared to the one in scikit-learn,
 it is more theoretically correct. The one in scikit-learn obtains sparser
 weight vectors in practice but has no theoretical justification (it's an
 heuristic). My goal was to compare coordinate descent with regular
 truncated/projected SGD so I didn't implement this heuristic.


ok - probably better to use this one (or the projection based method by
Duchi) - on the other hand, the Tsuruoka et al method served me quite well
in the past

thx,
 Peter




 Mathieu


 --
 Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
 Get 100% visibility into your production application - at no cost.
 Code-level diagnostics for performance bottlenecks with 2% overhead
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap1
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with 2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Effects of shifting and scaling on Gradient Descent

2013-04-26 Thread Peter Prettenhofer

 learning
toolkit. Gradient descent is a general class of
 optimization algorithms.

Ga?l
   
   --
   sp
   
   
  -- sp

 --
 sp



 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring
 service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt!
 http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Peter Prettenhofer

Hi Youssef,

please make sure that you use the latest version of sklearn (= 0.13) - we
did some enhancements to the sub-sampling procedure lately.

Looking at the RandomForest code - it seems that the jobs=-1 should not be
the issue for the parallel training of the trees since ``n_jobs =
min(cpu_count(), self.n_estimators)`` which should be just 3 in your case,
however, it will use cpu_count() processes to sort the feature values - so
the bottleneck might be here. Please try to set the n_jobs parameter to a
smaller constant (e.g. 4) and check if it works better.

having said that: 1E8 samples is pretty large - the largest dataset that
I've used so far was merely 1E6 but I've heard that people have used it for
larger datasets too (probably not 1E8 though).

Running the code on a cluster using IPython parallel should not be too hard
- RF is a pretty simple algorithm - you could either patch the existing
code to use IPython parallel instead of Joblib.Parallel (see forest.py) or
 simply write you own RF code which directly uses
``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it
doesn't help much IMHO and can make the implementation a bit more
involved - AFAIK the MSR guys didn't used boostrapping for their Kinect
RF system...

When it comes to other implementations you could look at rt-rank [1], which
is a parallel implementation of both GBRT and RF; and WiseRF [2], which is
compatible with sklearn but you have to obtain a license (free trial and
academic version AFAIK).

HTH,

 Peter

[1] https://sites.google.com/site/rtranking/

[2] http://about.wise.io/


Am 25.04.2013 03:22 schrieb Youssef Barhomi youssef.barh...@gmail.com:

 Hello,

 I am trying to reproduce the results of this paper:
 http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
 different kinds of data (monkey depth maps instead of humans). So I am
 generating my depth features and training  and classifying data with a
 random forest with quite similar parameters of the paper.

 I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
 samples with 500 features. Since it seems to be a large dataset of feature
 vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
 the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
 according to this:
 http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
 samples are taking a long time and I don't know when they will be done, I
 would like better ways to estimate the ETA or find a way to speed up the
 processing training. Also, I am watching my memory usage and I don't seem
 to be swapping (29GB/48GB being used right now). The other thing is that I
 requested n_jobs = -1 so it could use all cores of my machine (24 cores)
 but looking to my CPU usage, it doesn't seem to be using any of them...

 So, do you guys have any ideas on:
 - would a 1E8 samples be doable with your implementation of random forests
 (3 trees , 20 levels deep)?
 - running this code on a cluster using different iPython engines? or would
 that require a lot of work?
 - PCA for dimensionality reduction? (on the paper, they haven't used any
 dim reduction, so I am trying to avoid that)
 - other implementations that I could use for large datasets?

 PS: I am very new to this library but I am already impressed!! It's one of
 the cleanest and probably most intuitive machine learning libraries out
 there with a pretty impressive documentation and tutorials. Pretty amazing
 work!!

 Thank you very much,
 Youssef


 
 ###Here is a code snippet:
 

 from sklearn.datasets import make_classification
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.cross_validation import train_test_split
 from sklearn.preprocessing import StandardScaler
 import time
 import numpy as np

 n_samples = 1000
 n_features = 500
 X, y = make_classification(n_samples, n_features, n_redundant=0,
 n_informative=2,
random_state=1, n_clusters_per_class=1)
 clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
 'entropy', n_jobs = -1, verbose = 10)

 rng = np.random.RandomState(2)
 X += 2 * rng.uniform(size=X.shape)
 linearly_separable = (X, y)
 X = StandardScaler().fit_transform(X)
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
 tic = time.time()
 clf.fit(X_train, y_train)
 score = clf.score(X_test, y_test)
 print 'Time taken:', time.time() - tic, 'seconds'


 --
 Youssef Barhomi, MSc, MEng.
 Research Software Engineer at the CLPS department
 Brown University
 T: +1 (617) 797 9929  | GMT -5:00


 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer

I totally agree with Brian - although I'd suggest you drop option 3)
because it will be a lot of work.

I'd suggest you rather should do a) feature extraction or b) feature
selection.

Personally, I think decision trees in general and random forest in
particular are not a good fit for sparse datasets - if the average number
of non-zero values for each feature is low than your partitions will be
relatively small - any subsequent splits will make the partitions even
smaller thus you cannot grow your trees deep since you will run out of
samples. This means that your tree in fact uses just a tiny fraction of the
available features (compared to a deep tree) - unless you have a few pretty
strong features or you train lots of trees this won't work out. This is
probably also the reason why most of the decision tree work in natural
language processing is done using boosted decision trees of depth one. If
your features are boolean than such a model is in fact pretty similar to a
simple logistic regression model.

I've the impression that Random Forest in particular is a poor evidence
accumulator (pooling evidence from lots of weak features) - linear models
and boosted trees are much better here.

best,
 Peter


2013/4/24 Brian Holt bdho...@gmail.com

 At the moment your three options are
 1) get more memory
 2) do feature selection - 400k features on 200k samples seems to me to
 contain a lot of redundant information or irrelevant features
 3) submit a PR to support dense matrices - this is going to be a lot of
 work and I doubt it's worth it.

 All the best
 Brian
 On Apr 24, 2013 5:14 AM, Calvin Morrison mutanttur...@gmail.com wrote:

 get more memory?

 On 23 April 2013 17:06, Alex Kopp ark...@cornell.edu wrote:
  Hi,
 
  I am looking to build a random forest regression model with a pretty
 large
  amount of sparse data. I noticed that I cannot fit the random forest
 model
  with a sparse matrix. Unfortunately, a dense matrix is too large to fit
 in
  memory. What are my options?
 
  For reference, I have just over 400k features and just over 200k
 training
  examples
 
 
 --
  Try New Relic Now  We'll Send You this Cool Shirt
  New Relic is the only SaaS-based application performance monitoring
 service
  that delivers powerful full stack analytics. Optimize and monitor your
  browser, app,  servers with just a few lines of code. Try New Relic
  and get this awesome Nerd Life shirt!
 http://p.sf.net/sfu/newrelic_d2d_apr
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring
 service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt!
 http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer

2013/4/24 Olivier Grisel olivier.gri...@ensta.org

 2013/4/24 Peter Prettenhofer peter.prettenho...@gmail.com:
  I totally agree with Brian - although I'd suggest you drop option 3)
 because
  it will be a lot of work.
 
  I'd suggest you rather should do a) feature extraction or b) feature
  selection.
 
  Personally, I think decision trees in general and random forest in
  particular are not a good fit for sparse datasets - if the average
 number of
  non-zero values for each feature is low than your partitions will be
  relatively small - any subsequent splits will make the partitions even
  smaller thus you cannot grow your trees deep since you will run out of
  samples. This means that your tree in fact uses just a tiny fraction of
 the
  available features (compared to a deep tree) - unless you have a few
 pretty
  strong features or you train lots of trees this won't work out. This is
  probably also the reason why most of the decision tree work in natural
  language processing is done using boosted decision trees of depth one. If
  your features are boolean than such a model is in fact pretty similar to
 a
  simple logistic regression model.
 
  I've the impression that Random Forest in particular is a poor evidence
  accumulator (pooling evidence from lots of weak features) - linear
 models
  and boosted trees are much better here.

 Very interesting consideration. Any reference paper to study this in
 more details (both theory and empirical validation)?


actually, no - just gut feeling based on how decision trees / RF works
(hard non-intersecting partitions) - I will try to digg something up -
would definitely like to hear any critics/remarks to my view though.



 Also do you have good paper that demonstrate state of the art results
 with boosted stumps for NLP?


I haven't seen any use of boosted stumps in NLP for a while - but maybe I
didn't pay close attention - what comes to my mind is some work by Xavier
Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task
- actually, a number of participants used boosting/trees).
Joseph Turian used boosting in his thesis on parsing [2].

[1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf
[2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf



 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Peter Prettenhofer

Have you tried tuning the hyper-parameters of the SGDRegressor? You really
need to tune the learning rate for SGDRegressor (SGDClassifier has a pretty
decent default). E.g. set up a grid search w/ a constant learning rate and
try different values of eta0 ([0.1, 0.01, 0.001, 0.0001]). You can also set
verbose=3 to see the loss after each epoch which you can use to check the
convergence.


2013/4/24 Alex Kopp ark...@cornell.edu

 Thanks, guys.

 Perhaps I should explain what I am trying to do and then open it up for
 suggestions.

 I have 203k training examples each with 457k features. The features are
 composed of one-hot encoded categorical values as well as stemmed, TFIDF
 weighted unigrams and bigrams (NLP). As you can probably guess, the
 overwhelming majority of the features are the unigrams and bigrams.

 In the end, I am looking to build a regression model. I have tried a grid
 search on SGDRegressor, but have not had any promising results (~0.00 or
 even negative R^2 values).

 I would appreciate ideas/suggestions.

 Thanks

 ps, if it matters, I have 8 cores and 52gb ram at my disposal.

 On Wed, Apr 24, 2013 at 5:32 AM, Peter Prettenhofer 
 peter.prettenho...@gmail.com wrote:




 2013/4/24 Olivier Grisel olivier.gri...@ensta.org

 2013/4/24 Peter Prettenhofer peter.prettenho...@gmail.com:
  I totally agree with Brian - although I'd suggest you drop option 3)
 because
  it will be a lot of work.
 
  I'd suggest you rather should do a) feature extraction or b) feature
  selection.
 
  Personally, I think decision trees in general and random forest in
  particular are not a good fit for sparse datasets - if the average
 number of
  non-zero values for each feature is low than your partitions will be
  relatively small - any subsequent splits will make the partitions even
  smaller thus you cannot grow your trees deep since you will run out of
  samples. This means that your tree in fact uses just a tiny fraction
 of the
  available features (compared to a deep tree) - unless you have a few
 pretty
  strong features or you train lots of trees this won't work out. This is
  probably also the reason why most of the decision tree work in natural
  language processing is done using boosted decision trees of depth one.
 If
  your features are boolean than such a model is in fact pretty similar
 to a
  simple logistic regression model.
 
  I've the impression that Random Forest in particular is a poor
 evidence
  accumulator (pooling evidence from lots of weak features) - linear
 models
  and boosted trees are much better here.

 Very interesting consideration. Any reference paper to study this in
 more details (both theory and empirical validation)?


 actually, no - just gut feeling based on how decision trees / RF works
 (hard non-intersecting partitions) - I will try to digg something up -
 would definitely like to hear any critics/remarks to my view though.



 Also do you have good paper that demonstrate state of the art results
 with boosted stumps for NLP?


 I haven't seen any use of boosted stumps in NLP for a while - but maybe I
 didn't pay close attention - what comes to my mind is some work by Xavier
 Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task
 - actually, a number of participants used boosting/trees).
 Joseph Turian used boosting in his thesis on parsing [2].

 [1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf
 [2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf



 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring
 service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt!
 http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Peter Prettenhofer


 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based application performance monitoring
 service
 that delivers powerful full stack analytics. Optimize and monitor your
 browser, app,  servers with just a few lines of code. Try New Relic
 and get this awesome Nerd Life shirt!
 http://p.sf.net/sfu/newrelic_d2d_apr
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Try New Relic Now  We'll Send You this Cool Shirt
 New Relic is the only SaaS-based

Re: [Scikit-learn-general] Our own Olivier Grisel giving a scipy keynote

2013-04-17 Thread Peter Prettenhofer

That's great - congratulations Olivier!

Definitely, no pressure ;-)


2013/4/17 Ronnie Ghose ronnie.gh...@gmail.com

 wow :O congrats


 On Tue, Apr 16, 2013 at 7:17 PM, Mathieu Blondel math...@mblondel.orgwrote:

 Very well-deserved. Congrats!


 On Wed, Apr 17, 2013 at 4:48 AM, Gael Varoquaux 
 gael.varoqu...@normalesup.org wrote:

 I have been somewhat living under a rock lately, so I am not sure that it
 has been around this mailing list: @ogrisel is giving a keynote at scipy
 this year.

 http://conference.scipy.org/scipy2013/keynotes.php

 Hurray! Congratulations Olivier


 --
 Precog is a next-generation analytics platform capable of advanced
 analytics on semi-structured data. The platform includes APIs for
 building
 apps and a phenomenal toolset for data science. Developers can use
 our toolset for easy data analysis  visualization. Get a free account!
 http://www2.precog.com/precogplatform/slashdotnewsletter
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Precog is a next-generation analytics platform capable of advanced
 analytics on semi-structured data. The platform includes APIs for building
 apps and a phenomenal toolset for data science. Developers can use
 our toolset for easy data analysis  visualization. Get a free account!
 http://www2.precog.com/precogplatform/slashdotnewsletter
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Precog is a next-generation analytics platform capable of advanced
 analytics on semi-structured data. The platform includes APIs for building
 apps and a phenomenal toolset for data science. Developers can use
 our toolset for easy data analysis  visualization. Get a free account!
 http://www2.precog.com/precogplatform/slashdotnewsletter
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sparse Matrix Formats

2013-04-15 Thread Peter Prettenhofer

2013/4/15 Vlad Niculae zephy...@gmail.com

It really depends on each estimator and there is not one format that's
better every time. It's the same as with dense arrays, with C versus
Fortran ordering.

I did a quick check on the supervised methods:

the coordinate descent methods (ElasticNet, Lasso) use CSC format for
sparse and Fortran format for dense data.
All others (SGD, LinearSVC, SVC, NaiveBayes, Ridge) assume CSR format for
sparse and C format for dense.

Unfortunately I can't give an example off the top of my head; but I
think that between SVC, LinearSVC and SGDClassifier, two of them must
disagree on this.

Best way to know is to thoroughly check the docs of the objects you're
working in. If nothing is said there, go to the source code and maybe
the first couple of lines will clue you in. Algorithms that have
already been optimized for a specific format will usually convert the
data to that format before starting with ``utils.check_arrays``.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L127

Cheers,
Vlad

On Mon, Apr 15, 2013 at 4:00 AM, Philipp Singer kill...@gmail.com wrote:
Afaik scikit learn works with csr matrices internally as many
mathematical
operations are just possible for csr matrices.

Am 14.04.2013 20:01, schrieb Alex Kopp:

Is there a sparse matrix format that is most efficient for sklearn? (COO
vs
CSR vs LIL)

Thanks

___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for
building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [Broken] scikit-learn/scikit-learn#1530 (master - af674ac)

2013-04-09 Thread Peter Prettenhofer

Seems like travis has troubles with fetching from mldata (again) - can I
ignore it or should I trigger the travis build again and hope it will work
out?

thx,
 Peter


2013/4/9 Travis CI notificati...@travis-ci.org

 **
The build was broken.   Repository scikit-learn/scikit-learn  Build
 #1530 
 https://travis-ci.org/scikit-learn/scikit-learn/builds/6182009http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=https%3A%2F%2Ftravis-ci.org%2Fscikit-learn%2Fscikit-learn%2Fbuilds%2F6182009url_id=2ebb6f2bf18562c31808d347b9e0c17f98c54489tags=_all,_sendnotificati...@travis-ci.org,production
 Changeset
 https://github.com/scikit-learn/scikit-learn/compare/382f74c9600f...af674acc878bhttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=https%3A%2F%2Fgithub.com%2Fscikit-learn%2Fscikit-learn%2Fcompare%2F382f74c9600f...af674acc878burl_id=d3a6d5514b5375d88cb21d4af0c9acaebc9b8791tags=_all,_sendnotificati...@travis-ci.org,production
Commit af674ac (master)  Message get rid of ``rho`` in sgd
 documentation - has been replaced by ``l1_ratio``  Author Peter
 Prettenhofer  Duration 4 minutes and 47 seconds
You can configure recipients for build notifications in your configuration
 filehttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Fabout.travis-ci.org%2Fdocs%2Fuser%2Fbuild-configurationurl_id=d5ea037b4dd9f159cc222c92df5922c6f4a198f7tags=_all,_sendnotificati...@travis-ci.org,production.
 Further documentation about Travis CI can be found 
 herehttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Fabout.travis-ci.org%2Fdocsurl_id=b55fc489d79553a340ffa5ece9dbec09f486810ctags=_all,_sendnotificati...@travis-ci.org,production.
 For help please join our IRC channel irc.freenode.net#travis.
 We need your help!

 Travis CI has run 406,714 tests for 5,442 OSS projects to date, including
 Ruby, Rails, Rubinius, Rubygems, Bundler, Node.js, Leiningen, Symfony ...

 If you use any of these then you benefit from Travis CI.

 Please donate so we can make Travis CI even 
 better!http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Flove.travis-ci.orgurl_id=39ffd8df3c2757a0a1e8486c466a38e2f1543889tags=_all,_sendnotificati...@travis-ci.org,production

 See all of our sponsors 
 →http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Flove.travis-ci.org%2Fsponsorsurl_id=3ec963eaf6db187d3896baf03386e5e6b3a1ff4etags=_all,_sendnotificati...@travis-ci.org,production




-- 
Peter Prettenhofer
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [Broken] scikit-learn/scikit-learn#1530 (master - af674ac)

2013-04-09 Thread Peter Prettenhofer

Ok - thanks!


2013/4/9 Olivier Grisel olivier.gri...@ensta.org

 2013/4/9 Peter Prettenhofer peter.prettenho...@gmail.com


 Seems like travis has troubles with fetching from mldata (again) - can I
 ignore it or should I trigger the travis build again and hope it will work
 out?


 You can ignore.

 The problem is actually not that travis has troubles with fetching from
 mldata. The problem is that running the doctests on travis ignores the
 fixture [1] that should be enabled by the setup.cfg file [2].

 This fixture (that should install a mock urllib2.urlopen function to avoid
 using the network) has always been working on all the workstations I used
 and is working on jenkins as well. There is something in the travis
 environment that makes it not run though. No idea what.


 [1]
 https://github.com/scikit-learn/scikit-learn/blob/master/doc/datasets/mldata_fixture.py
  [2]
 https://github.com/scikit-learn/scikit-learn/blob/master/setup.cfg#L16

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Precog is a next-generation analytics platform capable of advanced
 analytics on semi-structured data. The platform includes APIs for building
 apps and a phenomenal toolset for data science. Developers can use
 our toolset for easy data analysis  visualization. Get a free account!
 http://www2.precog.com/precogplatform/slashdotnewsletter
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 39, Issue 13

2013-04-08 Thread Peter Prettenhofer

Hi,

I haven't used libFM (Factorization Machines) myself but I've heard that
others have used them quite successfully.

Corey (Lynch) created cython bindings for libFM

https://github.com/coreylynch/pyLibFM

best,
 Peter



2013/4/8 Andreas Mueller amuel...@ais.uni-bonn.de


 Factorization machines is a 2010 paper with 20 citations.
 I think that is a clear no.



 --
 Minimize network downtime and maximize team effectiveness.
 Reduce network management and security costs.Learn how to hire
 the most talented Cisco Certified professionals. Visit the
 Employer Resources Portal
 http://www.cisco.com/web/learning/employer_resources/index.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] SO question for the tree growers

2013-04-04 Thread Peter Prettenhofer

I posted a brief description of the algorithm. The method that we implement
is briefly described in ESLII. Gilles is the expert here, he can give more
details on the issue.


2013/4/4 Olivier Grisel olivier.gri...@ensta.org

 The variable importance in scikit-learn's implementation of random
 forest is based on the proportion of samples that were classified by
 the feature at some point in one of the decision trees evaluation.


 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation

 This method seems different from the OOB based method of Breiman 2001
 (section 10):

 http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

 Is there any reference for the method implemented in the scikit?

 Here is the original Stack Overflow question:


 http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined/15811003?noredirect=1#comment22487062_15811003

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel


 --
 Minimize network downtime and maximize team effectiveness.
 Reduce network management and security costs.Learn how to hire
 the most talented Cisco Certified professionals. Visit the
 Employer Resources Portal
 http://www.cisco.com/web/learning/employer_resources/index.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] OOB score in gradient boosting models

2013-03-22 Thread Peter Prettenhofer

Hi Yanir,

thanks for raising this issue.
I've implemented this feature without much though; furthermore, I
haven't used OOB estimates in my work yet.
I need to think more deeply about the issue - will come back to you.

You propose to update ``y_pred`` only for the in-bag samples, correct?

best,
 Peter

2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de:
 Hi Yanir.
 I was not aware that GradientBoosting had oob scores.
 Is that even possible / sensible? It definitely does not do what it promises
 :-/

 Peter, any thoughts?

 Cheers,
 Andy


 On 03/22/2013 11:39 AM, Yanir Seroussi wrote:

 Hi,

 I'm new to the mailing list, so I apologise if this has been asked before.

 I want to use the oob_score_ in GradientBoostingRegressor to determine the
 optimal number of iterations without relying on an external validation set,
 so I set the subsample parameter to 0.5 and trained the model. However, I've
 noticed that oob_score_ improves in a similar manner to the in-bag scores
 (train_score_). That is, it goes down very fast, and keeps improving
 regardless of the number of iterations.

 Digging through the code in ensemble/gradient_boosting.py, it seems like the
 cause is that oob_score_[i] includes previous trees that were trained on the
 OOB instances of the i-th sample. Isn't the OOB score supposed to be
 calculated for each OOB instance using only trees that where this instance
 wasn't used for training (as done for random forests)?

 Cheers,
 Yanir


 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_mar



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_mar
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] OOB score in gradient boosting models

2013-03-22 Thread Peter Prettenhofer

I've opened an issue for this:
https://github.com/scikit-learn/scikit-learn/issues/1802

2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de:
 We should open an issue in the issue tracker.

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_mar
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] OOB score in gradient boosting models

2013-03-22 Thread Peter Prettenhofer

2013/3/22 Yanir Seroussi yanir.serou...@gmail.com:
 Thanks for the quick response. Good to see that I'm not imagining things :-)

 Before posting this question, I had a look at Friedman's paper and ESLII and
 the R gbm documentation, but I couldn't find a clear description of how OOB
 estimates are calculated. I think it makes sense to have a separate
 y_oob_pred.
 I'll probably try fixing it locally over the weekend (unless you beat me to
 it). I'll let you know how it goes.

If you manage to fix it, a PR would be much appreciated!
Please keep me posted about your progress.

thanks,
 peter


 Cheers,
 Yanir


 On 22 March 2013 23:27, Peter Prettenhofer peter.prettenho...@gmail.com
 wrote:

 I've opened an issue for this:
 https://github.com/scikit-learn/scikit-learn/issues/1802

 2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de:
  We should open an issue in the issue tracker.
 
 
  --
  Everyone hates slow websites. So do we.
  Make your web apps faster with AppDynamics
  Download AppDynamics Lite for free today:
  http://p.sf.net/sfu/appdyn_d2d_mar
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Peter Prettenhofer


 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_mar
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_mar
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Why Gaussian Naive Bayes is not working as a base classifier?

2013-03-08 Thread Peter Prettenhofer

Issam,

currently, GaussianNB does not support sample weights thus it cannot
be used w/ Adaboost.
In Weka, if a classifier does not support sample weights they fall
back to data set re-sampling. We could implement this strategy as well
but it would not be very efficient due to the data structures that we
use internally (i.e. numpy arrays).

best,
 Peter

2013/3/7 Issam issamo...@gmail.com:
 Evening Dear Developers!

 I'm peculiarly getting an error while using AdaBoostClassifier with
 GaussianNB() as a a base estimator.

 These are my commands

 In [65]: gnb = GaussianNB()
 In [66]: bdt = AdaBoostClassifier(gnb,n_estimators=100)
 In [67]: bdt.fit(X,y)

 I get the following error after executing In[67] :

 TypeError: fit() got an unexpected keyword argument 'sample_weight'

 Any reason why I might be getting this?

 PS: I frequently use adaboosting with Navie Bayes as a base classifier in
 WEKA, hence the concern :)

 Thank you very much!

 Best regards,
 --Issam Laradji






 --
 Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
 Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the
 endpoint security space. For insight on selecting the right partner to
 tackle endpoint security challenges, access the full report.
 http://p.sf.net/sfu/symantec-dev2dev
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] one class svm probability

2013-03-05 Thread Peter Prettenhofer

libsvm does not support probability outputs for one-class SVM.
One-class SVM is an algorithm for support estimation (not proper
density estimation) - i.e. you get a confidence that P(X)  t - where
t is somewhat concealed in the nu parameter.

2013/3/5 Lars Buitinck l.j.buiti...@uva.nl:
 2013/3/5 Bill Power bill.power...@gmail.com:
 investigating previous versions i saw that probability was available
 in version 0.9 with predict_proba and predict_log_proba functions
 http://scikit-learn.org/0.9/modules/generated/sklearn.svm.OneClassSVM.html

 but it's not here in the stable version
 http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

 The methods never worked, so they were pruned in a refactoring round.

 --
 Lars Buitinck
 Scientific programmer, ILPS
 University of Amsterdam

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to load data into scikits

2013-02-27 Thread Peter Prettenhofer

Hi David,

I recommend that you load the data using Pandas (``pandas.read_csv``).
Scikit-learn does not support categorical features out-of-the-box; you
need to encode them as dummy variables (aka one-hot encoding) - you
can do this either using ``sklearn.preprocessing.DictVectorizer`` or
via ``pandas.get_dummies`` .

HTH,
 Peter

2013/2/27 David Montgomery davidmontgom...@gmail.com:
 Hi,

 I have a data structure that looks like this:

 1 NewYork 1 6 high
 0 LA 3 4 low
 ...

 I am trying to predict probability where Y is column one.  The all of the
 attributes of the X are categorical and I will use a dtree regression.  How
 do I load this data into the y and X?

 Thanks

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to load data into scikits

2013-02-27 Thread Peter Prettenhofer

2013/2/27 David Montgomery davidmontgom...@gmail.com:
 Oknow I am really confused on how to interpret the tree.

 So...I am trying to build a Prob est tree.  All of the independent variables
 are categorical and created dummies.  What is throwing me off are the =.

 I should have a rule that says e.g. if city=LA,NY and TIME=Noon then .20.

 In the chart I see city=Dubai=.500  What does that mean?

city.Dubai = 0.5 means that if the indicator variable city=Dubai is
smaller than 0.5 (i.e if city=Dubai is 0) then examples get routed
down the left child otherwise they get routed down the right child.


 What I am trying
 so see is a chart that I would usually see in SPSS answer tree or SAS etc.

since both SPSS and SAS are proprietary I've no clue how they look like


 So..how do I interpret the city=Dubai=.500?

The split node basically asks: is the city feature not Dubai? - if so
go down left else right

In order to generate rules from decision trees you have to look at a
whole path (from root to leaf). Currently, there is no way to
extracting rules from decision trees - you have to write your own code
that analyzes the tree structure.


 My aim is to get a node id and to create sql rules to extract data.

 Unless I am wrong, it appears the the dtree algo is not designed to extract
 rules and even assign a rule to a node id.  Dtrees in scikits are solely for
 prediction.  Is this a fair statement?

correct, scikit-learn is mostly a machine learning library; in fact,
AFAIK you where the first user to request such a feature.


 I will be taking the *.dot file not to graph but to somehow parse the file
 so I can create my rules.

better operate on the DecisionTreeRegressor/Classifier.tree_ object.
It represents the binary decision tree as a number of parallel arrays;
you can find the documentation/code here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38

best,
 Peter



 Thanks

















 On Wed, Feb 27, 2013 at 11:57 PM, Peter Prettenhofer
 peter.prettenho...@gmail.com wrote:

 Looks good to me - save the output to a file (e.g. foobar.dot) and run
 the following command:

 $ dot -Tpdf foobar.dot -o foobar.pdf

 When I open the pdf all labels are correctly displayed - remember that
 they are not indicator features - so the thresholds are usually
 country=AU = 0.5.

 You can find more information here:
 http://scikit-learn.org/dev/modules/tree.html#classification

 2013/2/27 David Montgomery davidmontgom...@gmail.com:
  Thanks I used DictVectorizer()
 
  I am now trying to add lables to the tree graph.   Below are the labels
  and
  the digraph Tree.  However, I dont see lables on the tree nodes.  Did I
  not
  use feature names correct?
 
 
 
 
  measurements = [
  {'country':'US','city': 'Dubai'},
  {'country':'US','city': 'London'},
  {'country':'US','city': 'San Fransisco'},
  {'country':'US','city': 'Dubai'},
  {'country':'AU','city': 'Mel'},
  {'country':'AU','city': 'Sydney'},
  {'country':'AU','city': 'Mel'},
  {'country':'AU','city': 'Sydney'},
  {'country':'AU','city': 'Mel'},
  {'country':'AU','city': 'Sydney'},
  ]
  y = [0,0,0,1,1,1,1,1,1,1]
 
 
  vec = DictVectorizer()
  X = vec.fit_transform(measurements)
  feature_name = vec.get_feature_names()
  clf = tree.DecisionTreeRegressor()
  clf = clf.fit(X.todense(), y)
  with open(au.dot, 'w') as f:
  f = tree.export_graphviz(clf, out_file=f,feature_names=feature_name)
 
 
  feature_name = ['city=Dubai', 'city=London', 'city=Mel', 'city=San
  Fransisco', 'city=Sydney', 'country=AU', 'country=US']
 
  digraph Tree {
  0 [label=country=AU = 0.5000\nerror = 2.1\nsamples = 10\nvalue = [
  0.7],
  shape=box] ;
  1 [label=city=Dubai = 0.5000\nerror = 0.75\nsamples = 4\nvalue = [
  0.25],
  shape=box] ;
  0 - 1 ;
  2 [label=error = 0.\nsamples = 2\nvalue = [ 0.], shape=box] ;
  1 - 2 ;
  3 [label=error = 0.5000\nsamples = 2\nvalue = [ 0.5], shape=box] ;
  1 - 3 ;
  4 [label=error = 0.\nsamples = 6\nvalue = [ 1.], shape=box] ;
  0 - 4 ;
  }
 
 
 
 
  On Wed, Feb 27, 2013 at 9:50 PM, Peter Prettenhofer
  peter.prettenho...@gmail.com wrote:
 
  Hi David,
 
  I recommend that you load the data using Pandas (``pandas.read_csv``).
  Scikit-learn does not support categorical features out-of-the-box; you
  need to encode them as dummy variables (aka one-hot encoding) - you
  can do this either using ``sklearn.preprocessing.DictVectorizer`` or
  via ``pandas.get_dummies`` .
 
  HTH,
   Peter
 
  2013/2/27 David Montgomery davidmontgom...@gmail.com:
   Hi,
  
   I have a data structure that looks like this:
  
   1 NewYork 1 6 high
   0 LA 3 4 low
   ...
  
   I am trying to predict probability where Y is column one.  The all of
   the
   attributes of the X are categorical and I will use a dtree
   regression.
   How
   do I load this data into the y and X?
  
   Thanks
  
  
  
   --
   Everyone hates slow websites. So do we

Re: [Scikit-learn-general] exporting/printing boost classifiers weaklearners

2013-02-26 Thread Peter Prettenhofer

Hi,

you should look into partial dependence plots [1] - they summarize the
effect of certain features on the target response. Currently, our PDPs
only support GradientBoostingRegressor/Classifier.

[1] http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence

best,
 Peter

2013/2/26  jo...@biociphers.org:
 Hello,
 I have been looking for a way to export the boost classifiers. I know that I 
 could print all the trees, but if I have 100 estimators, it starts to be not 
 a good idea. I was thinking on a way to summarize it printing the 
 weaklearners and its weigth. There is an easy way to do that?

 thanks for all

 Jordi


 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer

@ark: for 500K features and 3K classes your coef_ matrix will be:
50 * 3000 * 8 / 1024. / 1024. ~= 11GB

Coef_ is stored as a dense matrix - you might get a considerable
smaller matrix if you use sparse regularization (keeps most
coefficients zero) and convert the coef_ array to a scipy sparse
matrix prior to saving the object - this should cut your store costs
by a factor of 10-100.

To check the sparsity of ``coef_`` use::

sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size)

To convert the coef_ array do::

clf = ... # your fitted model
clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)


Prediction doesn't work currently (raises an Error) when coef_ is a
sparse matrix rather than an numpy array - this is a bug in sklearn
that should be fixed - I'll submit a PR for this.
In the meanwhile please convert back to a numpy array or patch the
SGDClassifier.decision_function method (adding ``dense_output=True``
when calling ``safe_sparse_dot`` should do the trick).

best,
 Peter

PS: I strongly recommend using sparse regularization (using
penatly='l1' or penalty='elasticnet') - this should cut your sparsity
significantly.

2013/2/22 Ark 4rk@gmail.com:

 You could cut that in half by converting coef_ and optionally
 intercept_ to np.float32 (that's not officially supported, but with
 the current implementation it should work):

 clf.coef_ = np.astype(clf.coef_, np.float32)

 You could also try the HashingVectorizer in sklearn.feature_extraction
 and see if performance is still acceptable with a small number of
 features. That also skips storing the vocabulary, which I imagine will
 be quite large as well.

  HashingVectorizer might indeed save some space...will test for acceptable
 answer...

 (I hope you meant 12000 document *per class*?)

  :( Unfortunately, no, I have 12000 documents in all..atleast as a start 
 point,
 Initially it is just to collect metrics, and as time goes on, mode
 documents per category will be added. Besides I am also limited on train time
 which seems to go over hour as the number of samples goes up..[My very first
 attempt was with 200k documents].
 Thanks for the suggestions.





 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer

I just opened a PR for this issue:
https://github.com/scikit-learn/scikit-learn/pull/1702

2013/2/22 Peter Prettenhofer peter.prettenho...@gmail.com:
 @ark: for 500K features and 3K classes your coef_ matrix will be:
 50 * 3000 * 8 / 1024. / 1024. ~= 11GB

 Coef_ is stored as a dense matrix - you might get a considerable
 smaller matrix if you use sparse regularization (keeps most
 coefficients zero) and convert the coef_ array to a scipy sparse
 matrix prior to saving the object - this should cut your store costs
 by a factor of 10-100.

 To check the sparsity of ``coef_`` use::

 sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size)

 To convert the coef_ array do::

 clf = ... # your fitted model
 clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)


 Prediction doesn't work currently (raises an Error) when coef_ is a
 sparse matrix rather than an numpy array - this is a bug in sklearn
 that should be fixed - I'll submit a PR for this.
 In the meanwhile please convert back to a numpy array or patch the
 SGDClassifier.decision_function method (adding ``dense_output=True``
 when calling ``safe_sparse_dot`` should do the trick).

 best,
  Peter

 PS: I strongly recommend using sparse regularization (using
 penatly='l1' or penalty='elasticnet') - this should cut your sparsity
 significantly.

 2013/2/22 Ark 4rk@gmail.com:

 You could cut that in half by converting coef_ and optionally
 intercept_ to np.float32 (that's not officially supported, but with
 the current implementation it should work):

 clf.coef_ = np.astype(clf.coef_, np.float32)

 You could also try the HashingVectorizer in sklearn.feature_extraction
 and see if performance is still acceptable with a small number of
 features. That also skips storing the vocabulary, which I imagine will
 be quite large as well.

  HashingVectorizer might indeed save some space...will test for acceptable
 answer...

 (I hope you meant 12000 document *per class*?)

  :( Unfortunately, no, I have 12000 documents in all..atleast as a start 
 point,
 Initially it is just to collect metrics, and as time goes on, mode
 documents per category will be added. Besides I am also limited on train time
 which seems to go over hour as the number of samples goes up..[My very first
 attempt was with 200k documents].
 Thanks for the suggestions.





 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Peter Prettenhofer



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer

http://xkcd.com/394/

2013/2/22 Olivier Grisel olivier.gri...@ensta.org:
 2013/2/22 Peter Prettenhofer peter.prettenho...@gmail.com:
 @ark: for 500K features and 3K classes your coef_ matrix will be:
 50 * 3000 * 8 / 1024. / 1024. ~= 11GB

 Nitpicking, that will be:

 50 * 3000 * 8 / 1024. / 1024. ~= 11GiB

 or:

 50 * 3000 * 8 / 1e6. ~= 12GB

 But nearly everybody is making the mistake...

 http://en.wikipedia.org/wiki/Gibibyte

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_feb
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forests: Measuring information gain in multi-output

2013-02-04 Thread Peter Prettenhofer

Hi Lukas,

the impurity (in your case entropy) is simply averaged over all
outputs - see [1] - the code is written in cython (a python dialect
that translates to C).

best,
 Peter

[1] 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1482

2013/2/4 Ribonous ribonucle...@gmail.com:
 I think I understand how a random forest classifier works in the univariate
 case. Unfortunately I haven't found much information about how to implement
 random forest classifier in the multi-output case.

 How does the random forest classifier in sklearn measure the information
 gain for a given split in the multi-output case ? Can anyone point me to
 references on this?

 Also, is the random forest implementation written in Python or another
 language?

 Thanks,

 Lukas



 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_jan
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using sklearn in Hadoop

2013-02-04 Thread Peter Prettenhofer

Cool example - thanks Nick!

2013/2/4 Robert Kern robert.k...@gmail.com:
 On Mon, Feb 4, 2013 at 2:50 PM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 @Robert sorry for the delay in responding, I was away on vacation.

 Here's a link to a gist of a very simple implementation of parallelized SGD
 using Spark (https://gist.github.com/4707012). It basically replicates the
 existing Spark logistic regression example, but using sklearn's linear_model
 module. However the approach used is iterative parameter mixtures (where the
 local weight vectors are averaged and the resulting weight vector
 rebroadcast) as opposed to distributed gradient descent (where the local
 gradients are aggregated, a gradient step taken on the master and the weight
 vector rebroadcast) - see
 http://faculty.utpa.edu/reillycf/courses/CSCI6175-F11/papers/nips2010mannetal.pdf
 for some details.

 Very cool. Thanks!

 --
 Robert Kern

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_d2d_jan
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] adaptive learning rate?

2013-01-28 Thread Peter Prettenhofer

no - SGDClassifier|SGDRegressor does not support per-feature learning rates

2013/1/28 Ronnie Ghose ronnie.gh...@gmail.com:
 Is there an adaptive learning rate per feature in sklearn?

 Ex.

 --adaptive: use per-feature adaptive learning rates; this is sensible for
 highly diverse and variable features

 from
 https://github.com/JohnLangford/vowpal_wabbit/wiki/Malicious-URL-example

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using sklearn in Hadoop

2013-01-23 Thread Peter Prettenhofer

Hi Jaganadh,

I once used hadoop to implement grid search / multi-task learning with
hadoop streaming. The setup was fairly simple: I put the serialized
dataset (joblib dump) on HDFS and created an input file - one line for
each parameter setting for grid search. The map script deserialized
the dataset from HDFS (in the init of the script) and for each map
task (=parameter setting) it trained a model, computed the prediction
error and emitted it. You can find some of the code here [1].

I used Hadoop because I had a Hadoop cluster at my disposal - nowadays
I'd use IPython.parallel and starcluster instead - much simpler IMHO.

best,
 Peter

[1] https://github.com/pprett/nut/blob/master/nut/structlearn/dumbomapper.py
 (this is the mapper script; the code which creates the input files
and puts everything onto HDFS is in the auxstrategy.py file)

2013/1/23 JAGANADH G jagana...@gmail.com:
 Hi All,

 Does anybody tried using sklearn with Hadoop/Dumbo or hadoop streaming.
 Please share your thoughts and experience.

 Best regards

 --
 **
 JAGANADH G
 http://jaganadhg.in
 ILUGCBE
 http://ilugcbe.org.in

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!

2013-01-22 Thread Peter Prettenhofer

according to the help the error msg show up when the form creator
stopped collecting responses by unchecking Accepting responses in the
Form menu (under the Tools menu). [1]

[1] http://support.google.com/drive/bin/answer.py?hl=enanswer=1715669

2013/1/22 Mathieu Blondel math...@mblondel.org:
 The link to the survey doesn't work.

 Mathieu

 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnnow-d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Gradient boosting complexity

2013-01-14 Thread Peter Prettenhofer

2013/1/13 Erik Bernhardsson erikb...@spotify.com:
Just a quick question about the gradient boosting in scikit-learn. We have
tons of data to regress on (like 100M data points), but the running time of
the algorithm is linear in the size of X no matter what subsample is set to.

Hi Erik,

the problem pertains not to gradient boosting but to our (current)
decision tree implementation. We use a bit mask (aka sample_mask) to
represent partitions of X. As you said, the algorithm is actually
linear in len(X) but only considers rows of X for which sample_mask is
True [1] - so ``subsample 0.5`` should run faster than ``subsample
== 1.0`` but its slower than passing X_subsample =
X[np.random.rand(len(X)) 0.5] directly to the fit method.

When the ``sample_mask`` gets too sparse (i.e. too much entries are
False) the algorithm spends most time checking the sample_mask -- not
very efficient, hence, we use a heuristic to make sure that when the
sample_mask gets too sparse (see ``min_density`` parameter) we copy X
and discard all rows where sample mask is False [2] - this, however,
results in both memory and runtime costs which have to be amortized.

Since trees in gradient boosting are usually shallow I decided to turn
off this heuristic (see [3]) - please try setting ``self.min_density =
0.1`` and test if you get a performance increase. If ``subsample`` is
smaller than ``min_density``, each tree will trigger a copy of X.

We (Brian, Gilles, Andy, and me) are not totally happy with our
current sample_mask-based tree implementation - personally, I think it
can be speed up considerably - but I think removing the sample_mask
would require a complete re-write of the tree building procedure.
The crux is to represent partitions efficiently while keeping
auxiliary data structures (i.e. X_argsorted -- a DS that holds for
each feature the list of examples sorted by ascending feature value)
in sync.
We have discussed various approaches to get rid of our sample_mask
approach in this issue [4].

If you want to leverage the whole dataset (100M) you might want to
explore a different approach as well: you could take a subsample
(100k) and train a GBRT (e.g. 1000k trees) on that; then you can use
this GBRT as a non-linear feature detector and augment each of the
100M examples with 1000 new features given by the output of each tree
in the GBRT model. Now you can feed the new dataset into a linear
model that scales to such a large dataset (e.g. vowpal wabbit).

best,
Peter

[1] see function _smallest_sample_larger_than;
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1826
[2]
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L511
[3]
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L563
[4] https://github.com/scikit-learn/scikit-learn/issues/964 (closed -
discussion continues in
https://github.com/scikit-learn/scikit-learn/issues/1435 )

Right now we just sample say 100k data points and run gradient boosting on
it, but it would be nice if we can use a much larger data set.

See
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L587
for the code – basically instead of subsampling, the algorithm just creates
a random binary mask.

It would be nice if it was linear in the len(X) * subsample because then we
could set subsample to a very small number and use a lot more data points.
That should reduce overfitting with no disadvantages really (afaik). I'm new
to gradient boosting and I don't know it that well. Is there a fundamental
reason why you can't make it linear in len(X) * subsample? Otherwise I might
try to put together a patch for it.

Thanks!

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Re: [Scikit-learn-general] Multivariate Adaptive Regression Splines (MARS, aka earth)

2013-01-10 Thread Peter Prettenhofer

2013/1/10 Lars Buitinck l.j.buiti...@uva.nl:
 2013/1/10 Jason Rudy ja...@clinicast.net
 I'm working on an implementation of MARS [1] that I'd like to share, and
 it seems like sklearn would be a good place for it.  The MARS algorithm is
 currently available as part of the R package earth and is one of the only
 reasons I still use R.  Would sklearn be a good place for such an algorithm?
 Are there any guidelines or procedures I should be aware of before
 contributing?

I'd love to see MARS in the sklearn - is your implementation currently
publicly available?


 I guess that would fit in scikit-learn, but I'm not an expert on fancy
 regression analysis. The contributor guidelines can be found here:

 https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md

 In addition, make sure that (1) you own the code or your employer is ok with
 you publishing it under BSD license terms, and (2) apparently MARS is a
 trademark so call the estimator something else, like EarthRegressor or
 MARegressionSplines.

 --
 Lars Buitinck
 Scientific programmer, ILPS
 University of Amsterdam


 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GridSearchCV does not work with SGDRegressor

2013-01-08 Thread Peter Prettenhofer

great - thanks Andy!

2013/1/8 Andreas Mueller amuel...@ais.uni-bonn.de:
 On 01/08/2013 09:57 AM, Andreas Mueller wrote:
 On 01/08/2013 09:49 AM, Ronnie Ghose wrote:
 yay :)
 Sorry, I was to fast. that was not the problem :( D'oh.


 yes it was. Double d'oh. I need to get some coffee, sorry

 --
 Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS
 and more. Get SQL Server skills now (including 2012) with LearnDevNow -
 200+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
 SALE $99.99 this month only - learn more at:
 http://p.sf.net/sfu/learnmore_122512
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS
and more. Get SQL Server skills now (including 2012) with LearnDevNow -
200+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only - learn more at:
http://p.sf.net/sfu/learnmore_122512
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Upgraded jenkins environment for matplotlib testing

2012-12-04 Thread Peter Prettenhofer

thanks!

2012/12/4 Andreas Mueller amuel...@ais.uni-bonn.de:
 Am 04.12.2012 12:35, schrieb Olivier Grisel:
 I have updated the virtualenvs of the jenkins vm to use:

 - ubuntu LTS matplotlib 0.99.1 on python 2.6
 - latest stable matplotlib 1.2.0 on python 2.7

 Merci beaucoup :)

 --
 LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
 Remotely access PCs and mobile devices and provide instant support
 Improve your efficiency, and focus on delivering more value-add services
 Discover what IT Professionals Know. Rescue delivers
 http://p.sf.net/sfu/logmein_12329d2d
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Shape of classes_ varies?

2012-11-29 Thread Peter Prettenhofer

I assume this is because they support multiple outputs; lets keep
@gilles posted.

2012/11/29 Doug Coleman doug.cole...@gmail.com:
 I forgot to include the line where I fit clf1.



-- 
Peter Prettenhofer

--
Keep yourself connected to Go Parallel: 
VERIFY Test and improve your parallel project with help from experts 
and peers. http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn

2012-11-28 Thread Peter Prettenhofer

Some more benchmarks from wise.io:

http://continuum.io/blog/wiserf-use-cases-and-benchmarks

quite impressive indeed - unfortunately I cannot post any comments on
the blog - I wonder if they use some sort of binned split evaluation
[1] instead of exact split evaluation (wiseRF has slightly lower
accuracy scores).
Maybe they want to contribute their code :-D

best,
 Peter

[1] http://hunch.net/~large_scale_survey/TreeEnsembles.pdf

-- 
Peter Prettenhofer

--
Keep yourself connected to Go Parallel: 
INSIGHTS What's next for parallel hardware, programming and related areas?
Interviews and blogs by thought leaders keep you ahead of the curve.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn

2012-11-28 Thread Peter Prettenhofer

2012/11/28 Andreas Mueller amuel...@ais.uni-bonn.de:
 Am 28.11.2012 16:46, schrieb Mathieu Blondel:



 On Thu, Nov 29, 2012 at 12:33 AM, Andreas Mueller amuel...@ais.uni-bonn.de
 wrote:

 Do you see where the sometimes 100x comes from?
 Not from what he demonstrates, right?

 scikit-learn is really bad when n_jobs=10. I would be interested in knowing
 if the performance gains are mostly coming from the fact that wiseRF is
 written in C++ or if they had to use algorithmic improvements.

 Why should C++ be any faster than Cython?

amongst others: template metaprogramming - see
http://lingpipe-blog.com/2011/07/01/why-is-c-so-fast/

if the input data is float64 you need to take conversion to float32
into account; furthermore sklearn will convert to fortran layout -
this will give a huge penalty in memory consumption.

 Templating number of bins in leafs?

 Maybe they learned a model to pick good default values for the forest for a
 dataset ;)

in terms of algorithms and split point evaluation: different
strategies are more appropriate for different feature types (lots vs.
few split points);


 --
 Keep yourself connected to Go Parallel:
 INSIGHTS What's next for parallel hardware, programming and related areas?
 Interviews and blogs by thought leaders keep you ahead of the curve.
 http://goparallel.sourceforge.net
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Keep yourself connected to Go Parallel: 
INSIGHTS What's next for parallel hardware, programming and related areas?
Interviews and blogs by thought leaders keep you ahead of the curve.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn

2012-11-28 Thread Peter Prettenhofer

2012/11/28 Mathieu Blondel math...@mblondel.org:
 scikit-learn's RF is entirely written in Python (forest.py) so there may
 still be some slow code paths. Moreover, their parallel implementation is
 probably written with pthreads or OpenMP so they bypass the problems that we
 have with Python's multiprocessing module.

I think this overhead is marginal - at the end of the day most time is
spend on building the trees and there is certainly room for
improvement there.

-- 
Peter Prettenhofer

--
Keep yourself connected to Go Parallel: 
INSIGHTS What's next for parallel hardware, programming and related areas?
Interviews and blogs by thought leaders keep you ahead of the curve.
http://goparallel.sourceforge.net
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Problem unpickling 0.11 RF model in 0.12/0.13

2012-11-20 Thread Peter Prettenhofer

Hi Nicolas,

unfortunately the two versions are not compatible - we did some
modifications (speed enhancements) to the tree module in version 0.12
that break serialization with older versions.

The only way to tackle this is to do as Leon proposed: extract the
state of the old trees (sklearn.tree.tree.Tree) and create new ones
and copy the state (see sklearn.tree._tree.Tree attributes). I haven't
done this myself to be honest and I would rather retrain a new
RandomForest.

sorry for the inconveniences caused.

best,
 Peter

2012/11/20 Leon Palafox leonoe...@gmail.com:
 I'm not a developer, but a fast ugly solution (well I do not know how fast)
 would be to do a script that unpacks everything using the old sklearn and
 repacks it using the new one.

 Best


 On Tue, Nov 20, 2012 at 6:23 PM, Fechner, Nikolas
 nikolas.fech...@novartis.com wrote:

 Hi all,
 I've build a random forest model using scikit-learn 0.11 and stored it for
 subsequent application as a pickled file using the sklearn.externals joblib.
 Now, I have started looking into a migration scheme to later scitkit-learn
 versions and noticed that it is apparently not possible to unpickle the
 stored model using scikit-learn 0.12 or 0.13.
 This is the error I do get (reproducible on Mac and Linux systems):


 /Library/Python/2.7/site-packages/scikit_learn-0.12.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/numpy_pickle.pyc
 in load(filename, mmap_mode)
 416
 417 try:
 -- 418 obj = unpickler.load()
 419 finally:
 420 if hasattr(unpickler, 'file_handle'):


 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc
 in load(self)
 856 while 1:
 857 key = read(1)
 -- 858 dispatch[key](self)
 859 except _Stop, stopinst:
 860 return stopinst.value


 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc
 in load_global(self)
1088 module = self.readline()[:-1]
1089 name = self.readline()[:-1]
 - 1090 klass = self.find_class(module, name)
1091 self.append(klass)
1092 dispatch[GLOBAL] = load_global


 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc
 in find_class(self, module, name)
1124 __import__(module)
1125 mod = sys.modules[module]
 - 1126 klass = getattr(mod, name)
1127 return klass
1128

 AttributeError: 'module' object has no attribute '_find_best_split'


 Is this something that could be fixed somehow, and more importantly, is it
 to be expected that it will be an ongoing problem that loading models built
 with previous versions cause problems?

 Many thanks in advance for any comments.

 Cheers,

 Nikolas




 --
 Monitor your physical, virtual and cloud infrastructure from a single
 web console. Get in-depth insight into apps, servers, databases, vmware,
 SAP, cloud infrastructure, etc. Download 30-day Free Trial.
 Pricing starts from $795 for 25 servers or applications!
 http://p.sf.net/sfu/zoho_dev2dev_nov
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Leon Palafox, M.Sc
 PhD Candidate
 Iba Laboratory
 +81-3-5841-8436
 University of Tokyo
 Tokyo, Japan.



 --
 Monitor your physical, virtual and cloud infrastructure from a single
 web console. Get in-depth insight into apps, servers, databases, vmware,
 SAP, cloud infrastructure, etc. Download 30-day Free Trial.
 Pricing starts from $795 for 25 servers or applications!
 http://p.sf.net/sfu/zoho_dev2dev_nov
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] func with args float / double to C f_f32 / f_f64

2012-11-19 Thread Peter Prettenhofer

2012/11/19 denis denis-bz...@t-online.de:
 Folks,
   from a python function with args that may be float or double
 I want to call a corresponding C functions f_f32 or f_f64.
 Is there a better way than cython like

 cdef extern from ...:
 int f_f32( float* A, float* B )
 int f_f64( double* A, double* B )
 ...

 def func_float_or_double( np.ndarray A, np.ndarray B ):
 assert A.dtype is B.dtype
 if A.dtype.name == 'float32':
 return f_f32( A, B )
 elif A.dtype.name == 'float64':
 return f_f64( A, B )
 ...

Try calling ``f_f64`` with the data buffers of ``A`` and ``B``::

 def func_float_or_double( np.ndarray A, np.ndarray B ):
 assert A.dtype is B.dtype
 if A.dtype.name == 'float32':
 return f_f64(np.float32_t*(A.data), np.float32_t*(B.data))
 elif A.dtype.name == 'float64':
 return f_f64(np.float64_t*(A.data), np.float64_t*(B.data))


 (This may be more of a cython question, but you sklearn people must do this
 often ?)

 thanks,
 cheers
   -- denis

 --
 Monitor your physical, virtual and cloud infrastructure from a single
 web console. Get in-depth insight into apps, servers, databases, vmware,
 SAP, cloud infrastructure, etc. Download 30-day Free Trial.
 Pricing starts from $795 for 25 servers or applications!
 http://p.sf.net/sfu/zoho_dev2dev_nov
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RandomForest benchmark

2012-11-17 Thread Peter Prettenhofer

Olivier,

I tested it with the joblib PR - results got a bit worse.

see below

best,
 Peter


arcene

  r   py
score   0.2700 (0.03)   0.2633 (0.02)
train   3.9454 (0.09)   4.6661 (0.20)
test0.2199 (0.00)   0.2985 (0.05)


landsat

  r   py
score   0.0255 (0.00)   0.0552 (0.00)
train   2.3184 (0.02)   3.8349 (0.06)
test0.1129 (0.00)   0.3513 (0.01)


spam

  r   py
score   0.0549 (0.00)   0.0664 (0.00)
train   1.6380 (0.01)   2.1307 (0.02)
test0.0379 (0.00)   0.3311 (0.00)


random_gaussian

  r   py
score   0.1449 (0.00)   0.1487 (0.01)
train   0.3371 (0.01)   1.3574 (0.04)
test0.1502 (0.00)   0.3247 (0.05)


madelon

  r   py
score   0.4061 (0.01)   0.3867 (0.02)
train   10.0216 (0.08)  10.4346 (0.08)
test0.0980 (0.00)   0.3221 (0.02)



2012/11/17 Olivier Grisel olivier.gri...@ensta.org:
 You can retry by replacing the sklearn/externals/joblib folder with
 the joblib folder of this branch:

 https://github.com/joblib/joblib/pull/44

 --
 Monitor your physical, virtual and cloud infrastructure from a single
 web console. Get in-depth insight into apps, servers, databases, vmware,
 SAP, cloud infrastructure, etc. Download 30-day Free Trial.
 Pricing starts from $795 for 25 servers or applications!
 http://p.sf.net/sfu/zoho_dev2dev_nov
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] set target_names / importance of features in a trained model

2012-11-12 Thread Peter Prettenhofer

2012/11/12 paul.czodrow...@merckgroup.com:

Dear SciKitters,

given an array of (n_samples,n_features) - How do I assign target_names in
a concluding step?

The target_names are stored in a list and, of course, have the same order
as the n_features vector.

In a next step, I would like to dump out the importance of the most
relevant features? How can this be done in scikit-learn? In particular, I
have trained a random forest and would like to dump out the leaves of this
RF.

Hi Paul,

this example shows how to access the feature importance computed by a RF::

http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py

in order to access the ``feature_importances_`` attribute you have to
use the following argument ``compute_importances=True`` .

To access the leaves of a decision tree you need to access the
``tree_`` attribute of a DecisionTree - it is basically a collection
of parallel arrays that represent the tree (see sklearn.tree._tree).
All indices where ``children_left`` and ``children_right`` is -1 are
leaves.

best,
Peter

Cheers Thanks,
Paul

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Re: [Scikit-learn-general] Panda / Tree and Random Forest

2012-10-24 Thread Peter Prettenhofer

Didier,

what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.

15 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).

best,
 Peter

2012/10/24 Didier Vila dv...@capquestco.com:
 Thanks, I will have a look.

 Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | 
 Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com

 -Original Message-
 From: Andreas Mueller [mailto:amuel...@ais.uni-bonn.de]
 Sent: 24 October 2012 15:44
 To: scikit-learn-general@lists.sourceforge.net
 Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest

 As an addition, maybe it would be good for you to have a look into the
 tutorial:
 http://scikit-learn.org/dev/tutorial/basic/tutorial.html

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 This e-mail is intended solely for the addressee, is strictly confidential 
 and may also be legally privileged. If you are not the addressee please do 
 not read, print, re-transmit, store or act in reliance on it or any 
 attachments. Instead, please email it back to the sender and then immediately 
 permanently delete it. E-mail communications cannot be guaranteed to be 
 secure or error free, as information could be intercepted, corrupted, 
 amended, lost, destroyed, arrive late or incomplete, or contain viruses. We 
 do not accept liability for any such matters or their consequences. Anyone 
 who communicates with us by e-mail is taken to accept the risks in doing so. 
 Opinions, conclusions and other information in this e-mail and any 
 attachments are solely those of the author and do not represent those of 
 CapQuest Group Limited or any of its subsidiaries unless otherwise stated. 
 CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery 
 Limited (registered number 3772278), CapQuest Investments Limited (registered 
 number 5245825), CapQuest Asset Management Limited (registered number 
 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) 
 are all limited companies registered in England and Wales with their 
 registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each 
 company is a separate and independent legal entity. None of the companies 
 have any liability for each other's acts or omissions. This communication is 
 from the company named in the sender's details above.

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Panda / Tree and Random Forest

2012-10-24 Thread Peter Prettenhofer

2012/10/24 Didier Vila dv...@capquestco.com:
 Peter,

 Thanks for the email.

 I just started to use Panda this morning.

 Feature are integer ( binary or 0-1-2-3) or real .

 Note that my target variable is continuous between 0 and 1.

Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.

best,
 Peter



 I just run   your code below and I still have the same issue on that.

 clf.fit(feature.values, label.values.ravel())

 Regards

 Didier

 Ps: the initial codes worked for 100 samples.

 Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | 
 Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com


 -Original Message-
 From: Peter Prettenhofer [mailto:peter.prettenho...@gmail.com]
 Sent: 24 October 2012 16:36
 To: scikit-learn-general@lists.sourceforge.net
 Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest

 Didier,

 what type is ``feature`` (simply print ``type(feature``)? Considering
 your first email I suspect its a pandas.DataFrame; scikit-learn
 estimators require array-like inputs - so please do
 ``clf.fit(features.values, labels.values.ravel())`` instead of
 ``clf.fit(features, values)``.

 15 is quite a lot; but if you just want to fit 5 trees it should
 run in under 15 seconds (I tested using random data and binary
 classification).

 best,
  Peter

 2012/10/24 Didier Vila dv...@capquestco.com:
 Thanks, I will have a look.

 Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet 
 | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com

 -Original Message-
 From: Andreas Mueller [mailto:amuel...@ais.uni-bonn.de]
 Sent: 24 October 2012 15:44
 To: scikit-learn-general@lists.sourceforge.net
 Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest

 As an addition, maybe it would be good for you to have a look into the
 tutorial:
 http://scikit-learn.org/dev/tutorial/basic/tutorial.html

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 This e-mail is intended solely for the addressee, is strictly confidential 
 and may also be legally privileged. If you are not the addressee please do 
 not read, print, re-transmit, store or act in reliance on it or any 
 attachments. Instead, please email it back to the sender and then 
 immediately permanently delete it. E-mail communications cannot be 
 guaranteed to be secure or error free, as information could be intercepted, 
 corrupted, amended, lost, destroyed, arrive late or incomplete, or contain 
 viruses. We do not accept liability for any such matters or their 
 consequences. Anyone who communicates with us by e-mail is taken to accept 
 the risks in doing so. Opinions, conclusions and other information in this 
 e-mail and any attachments are solely those of the author and do not 
 represent those of CapQuest Group Limited or any of its subsidiaries unless 
 otherwise stated. CapQuest Group Limited (registered number 4936030), 
 CapQuest Debt Recovery Limited (registered number 3772278), CapQuest 
 Investments Limited (registered number 5245825), CapQuest Asset Management 
 Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited 
 (registered number 05821008) are all limited companies registered in England 
 and Wales with their registered offices at Fleet 27, Rye Close, Fleet, 
 Hampshire, GU51 2QQ. Each company is a separate and independent legal 
 entity. None of the companies have any liability for each other's acts or 
 omissions. This communication is from the company named in the sender's 
 details above.

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 Peter Prettenhofer

 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://p.sf.net/sfu/appdyn_sfd2d_oct
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 This e-mail is intended solely

1 2 3 >

1 - 100 of 200 matches

Mail list logo