Re: [Scikit-learn-general] [ANN] scikit-learn 0.16.0 is out!
Hurray, great work everybody! 2015-03-27 19:51 GMT+01:00 Gael Varoquaux gael.varoqu...@normalesup.org: Works for me. Could you try refreshing your brower cache (Ctrl Shift R on some browsers). Gaël On Fri, Mar 27, 2015 at 06:23:06PM +, Jason Sanchez wrote: Update: For me, the stable documentation works, but the 0.16 documentation does not. Works: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html Does not work: http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html I have seen the updated images both in 0.16 and 0.15, in which 0.16 algorithms has less running time than in 0.15. Wei On Fri, Mar 27, 2015 at 1:14 PM, Jason Sanchez jason.sanchez.m...@statefarm.com wrote: The documentation for the release does not seem to include any of the images. Perhaps this is just showing on my end. Example: 0.16: * http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html* http://scikit-learn.org/0.16/auto_examples/cluster/plot_cluster_comparison.html 0.15: * http://scikit-learn.org/0.15/auto_examples/cluster/plot_cluster_comparison.html* http://scikit-learn.org/0.15/auto_examples/cluster/plot_cluster_comparison.html -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Gael Varoquaux Researcher, INRIA Parietal Laboratoire de Neuro-Imagerie Assistee par Ordinateur NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Welcome new core contributors
gogogo team!! 2014-10-13 9:19 GMT+02:00 Arnaud Joly a.j...@ulg.ac.be: Congratulation !!! Arnaud On 13 Oct 2014, at 03:13, Kyle Kastner kastnerk...@gmail.com wrote: Thanks everyone! There are some nice new extensions for that algorithm planned (randomized SVD!) once I get a moment to submit the proper PR. I am happy to be able to contribute for such an awesome group :) On Sun, Oct 12, 2014 at 3:55 PM, abhishek abhish...@gmail.com wrote: Congrats Kyle! I was waiting for this eagerly On Oct 12, 2014 9:31 PM, Robert Layton robertlay...@gmail.com wrote: Congrats! On 13 October 2014 05:42, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Thanks Gaël, Its a pleasure. Looking forward to learning and contributing more. On Sun, Oct 12, 2014 at 5:24 PM, Gael Varoquaux gael.varoqu...@normalesup.org wrote: I am happy to welcome new core contributors to scikit-learn: - Alexander Fabisch (@AlexanderFabisch) - Kyle Kastner (@kastnerkyle) - Manoj Kumar (@MechCoder) - Noel Dawe (@ndawe) Thank you all for your hard work on scikit-learn, and welcome to the team! Gaël -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Godspeed, Manoj Kumar, Mech Undergrad http://manojbits.wordpress.com -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0
Re: [Scikit-learn-general] Sparse Gradient Boosting Fully Corrective Gradient Boosting
Key advantage of using RuleFit [1] -- striking that they didnt cite it btw -- is that if you add the original features your model can a) better incorporate additive effects and b) extrapolate, a limitation of any tree-based method like GBRT or RF. [1] http://statweb.stanford.edu/~jhf/R-RuleFit.html 2014-09-22 20:48 GMT+02:00 Olivier Grisel olivier.gri...@ensta.org: 2014-09-21 10:46 GMT+02:00 Mathieu Blondel math...@mblondel.org: On Sun, Sep 21, 2014 at 1:55 AM, Olivier Grisel olivier.gri...@ensta.org wrote: On a related note, here is an implementeation of Logistic Regression applied to one-hot features obtained from leaf membership info of a GBRT model: http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb#Using-the-boosted-trees-to-extract-features-for-a-Logistic-Regression-model This is inspired by this paper from Facebook: https://www.facebook.com/publications/329190253909587/ . It's easy to implement and seems to work quite well. What is the advantage of this method over using GBRT directly? A significant improvement in F1-score for the positive / minority class and ROC AUC on this dataset (Adult Census binarized income prediction with integer encoding of the categorical variables). Apparently the facebook ad team reported the same kind of improvement on their own data. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Gradient Boosting Fully Corrective Gradient Boosting
The only reference I know is the Regularized Greedy Forest paper by Johnson and Zhang [1] I havent read the primary source (by Zhang as well). [1] http://arxiv.org/abs/1109.0887 2014-09-16 15:15 GMT+02:00 Mathieu Blondel math...@mblondel.org: Could you give a reference for gradient boosting with fully corrective updates? Since the philosophy of gradient boosting is to fit each tree against the residuals (or negative gradient) so far, I am wondering how such fully corrective update would work... Mathieu On Tue, Sep 16, 2014 at 9:16 AM, c TAKES ctakesli...@gmail.com wrote: Is anyone working on making Gradient Boosting Regressor work with sparse matrices? Or is anyone working on adding an option for fully corrective gradient boosting, I.E. all trees in the ensemble are re-weighted at each iteration? These are things I would like to see and may be able to help with if no one is currently working on them. -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Bug in OneClassSVM
Hi Luca, it segfaults?! Can you confirm that it also segfaults if you use the default arguments? There is no plot so I cannot say anything about the strange decision boundaries. For my part, I've never used something else than a RBF kernel for a one class svm; the RBF kernel has the nice property that all data points lie on the surface of a hypersphere and thus the minimum enclosing ball is just the hyperplane that separates those points and the origin with the max distance to the origin. 2014-09-15 10:58 GMT+02:00 Luca Puggini lucapug...@gmail.com: Hi, I am having some problems with the OneClassSVM function. Here you can see my file and the output. http://justpaste.it/h3pw I am sorry but I can not share the used data. I have experienced also other problems like strange decision boundaries. Can someone tell me if I am doing something wrong or if there is a problem in the function? Thanks, Luca -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] outlier measure random forest
+1 -- looks like a very handy 3-liner :) 2014-09-08 16:14 GMT+02:00 Gilles Louppe g.lou...@gmail.com: Hi Luca, This may not be the fastest implementation, but random forest proximities can be computed quite straightforwardly in Python given our 'apply' function. See for instance https://github.com/glouppe/phd-thesis/blob/master/scripts/ch4_proximity.py#L12 From a personal point of view, I never use them but since this is quite standard in other random forest implementations, this may be a nice little contribution. I dont know where it should be put though in scikit-learn, since it very much looks like a pairwise metric. What do other tree growers think? Cheers, Gilles On 8 September 2014 11:05, Luca Puggini lucapug...@gmail.com wrote: Hi, for personal reason I am writing a function to compute the outlier measure from random forest http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers with a little more work I can include the function in the sklearn random forest class. Is the community interested? Should I do it? I think that this would be useful. This function is already available in matlab http://www.mathworks.co.uk/help/stats/compacttreebagger-class.html Let me know. Best, Luca -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Libsvm, probabilities and weights
Thanks Mathieu, I agree -- a calibration module would be good to have anyways. I filed an issue on libsvms github account [1] [1] https://github.com/cjlin1/libsvm/issues/13 2014-08-13 3:00 GMT+02:00 Mathieu Blondel math...@mblondel.org: sample_weights in scikit-learn comes from a libsvm patch: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances So it would seem like probability calibration was omitted from this patch :-( When our calibration module is ready, we could handle the calibration post-processing ourselves in pure Python. Could you report an issue? Mathieu On Wed, Aug 13, 2014 at 3:33 AM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: SVC doesnt take class/sample weights into account when calibrating probabilities -- this seems to be a bug to me... https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/libsvm/svm.cpp#L1895 best, Peter -- Peter Prettenhofer -- ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse data, SGD, and intercept_decay
The way I implemented it, the learning rate for the intercept should be 0.01 times the learning rate of the other features. The value of .01 is something that I set empirically, I adopted it from Leon Buttou's sgd project and experimented with different values. I found that lower intercept learning rates help a bit but the concrete value is not too important - so I decided to use a fixed value. I think the decay value might in fact be a function of the number of non-zero values per feature. If you have a dataset with sparse and dense features then intercept decay should be turned off -- alternatively, you can also scale the dense features to decrease their magnitude. 2014-07-30 11:42 GMT+02:00 Danny Sullivan dsulliv...@hotmail.com: I found that for sparse data, the scikit implementation of sgd uses an intercept_decay variable set to .01 (SPARSE_INTERCEPT_DECAY) to avoid intercept oscillation. Shouldn't this be determined by the learning_rate instead? I'm asking because it adds a layer of tuning that the user doesn't have control over. Danny -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Confidence score for each prediction from regressor
Hi Yogesh, one of the few regressors that supports this in sklearn is GaussianProcess but that wont scale to your problem. An alternative is to use a GradientBoostingRegressor with quantile loss to generate prediction intervals (see [1]) -- only for the keen - i've once used that unsuccessfully in a Kaggle comp. Its not a confidence score though -- it can only tell you if its within a band. Maybe one can generate a confidence score from Random Forests... I remember that I read something along those lines in this survey [2]. best, Peter [1] http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html [2] http://research.microsoft.com/apps/pubs/default.aspx?id=12 2014-07-22 19:52 GMT+02:00 Yogesh Pandit yogesh...@gmail.com: Hello, I am working with regressors (sklearn.ensemble). Shape of my test data is (1121280, 452) I am wondering on how I can associate a confidence score for prediction for each sample from my test data. Any suggestions would be helpful. Thank you, -Yogesh -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Confidence score for each prediction from regressor
I might be wrong but it seems like Mathieu is working on something similar for Ridge this: https://github.com/scikit-learn/scikit-learn/pull/3417 2014-07-22 21:47 GMT+02:00 Peter Prettenhofer peter.prettenho...@gmail.com : Hi Yogesh, one of the few regressors that supports this in sklearn is GaussianProcess but that wont scale to your problem. An alternative is to use a GradientBoostingRegressor with quantile loss to generate prediction intervals (see [1]) -- only for the keen - i've once used that unsuccessfully in a Kaggle comp. Its not a confidence score though -- it can only tell you if its within a band. Maybe one can generate a confidence score from Random Forests... I remember that I read something along those lines in this survey [2]. best, Peter [1] http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html [2] http://research.microsoft.com/apps/pubs/default.aspx?id=12 2014-07-22 19:52 GMT+02:00 Yogesh Pandit yogesh...@gmail.com: Hello, I am working with regressors (sklearn.ensemble). Shape of my test data is (1121280, 452) I am wondering on how I can associate a confidence score for prediction for each sample from my test data. Any suggestions would be helpful. Thank you, -Yogesh -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Peter Prettenhofer -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] scikit-learn 0.15.0 is out \o/
great work guys - thanks! 2014-07-15 13:18 GMT+02:00 Satrajit Ghosh sa...@mit.edu: congrats all ! cheers, satra On Tue, Jul 15, 2014 at 7:13 AM, Olivier Grisel olivier.gri...@ensta.org wrote: http://scikit-learn.org/stable/whats_new.html Plenty of wheel packages on PyPI and people rejoice :) Thanks to all for your contributions! I know the website is half incorrect (especially the 0.14/ that has the 0.15 content). I screwed up again with rsync and symlinks. I am rebuilding a clean doc at the moment. Best, -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Getting decision tree regressor to predict using median not mean, of final subset
Hi James, if you look at the LAD loss function in the gradient_boosting module you can find an example how to do it. Basically, you need to update the values array in the Tree extension type. Tree.apply_Tree(x_train) gives you the training instances in each leaf. HTH, Peter Am 23.06.2014 13:48 schrieb James McMurray jamesmc...@gmail.com: Hi, I want to use the decision tree regressor to predict using the median of the resulting subset from the tree, rather than the mean? Is there a simple way to do this? I looked at the code, but in sklearn/tree/tree.py, the only relevant line is: proba = self.tree_.predict(X) Where the prediction is already done (presumably in the Cython code), I don't have experience with Cython so I'm not sure how to modify _tree.pyx to do this. Many thanks, James McMurray -- HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing Easy Data Exploration http://p.sf.net/sfu/hpccsystems ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing Easy Data Exploration http://p.sf.net/sfu/hpccsystems___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] My talk was approved for EuroScipy'14
congrats Gilles -- looking forward to your talk -- you should definitely make a blog post from your material (and benchmarks)! 2014-05-22 8:50 GMT+02:00 Vlad Niculae zephy...@gmail.com: This is great news, congratulations Gilles! Cheers, Vlad On May 22, 2014 8:15 AM, Gilles Louppe g.lou...@gmail.com wrote: Hi folks, Just for letting you know, my talk Accelerating Random Forests in Scikit-Learn was approved for EuroScipy'14. Details can be found at https://www.euroscipy.org/2014/schedule/presentation/9/. My slides are far from being ready, but my intention is to present our team efforts on the tree and ensemble modules, including along the way some of the lessons we have learned. In particular, I would like to thank @pprett, @arjoly, @larsmans, @ogrisel and @jnothman who have contributed a lot these last months to improve these modules! Thanks guys! Cheers, Gilles -- Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free. http://p.sf.net/sfu/SauceLabs ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free. http://p.sf.net/sfu/SauceLabs ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free. http://p.sf.net/sfu/SauceLabs___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] RandomForestClassifier w/ IPython.parallel
Hi Allessandro, you might want to look into this presentation by Olivier https://speakerdeck.com/ogrisel/growing-randomized-trees-in-the-cloud-1 -- it should be pretty much what you need. Code is here https://github.com/pydata/pyrallel. best, Peter 2014-02-07 23:28 GMT+01:00 Alessandro Gagliardi alessandro.gaglia...@glassdoor.com: Hi All, I want to run a large sklearn.ensemble.RandomForestClassifier (with maybe a dozens or maybe hundreds of trees and 100,000 samples). My desktop won't handle this so I want to try using StarCluster. RandomForestClassifier seems to parallelize easily, but I don't know how I would split it across many IPython.parallel engines (if that's even possible). (Or maybe I should be foregoing IPython.parallel and using MPI?) Any help would be greatly appreciated. Thanks, Alessandro Gagliardi| Glassdoor| alessan...@glassdoor.com *We're hiring! Check out our open jobs http://www.glassdoor.com/about/careers.htm.* *Twitter https://twitter.com/Glassdoor** | Facebook https://www.facebook.com/Glassdoor | Glassdoor Blog http://www.glassdoor.com/blog/* *2012 Webby Award Winner: Best Employment Site* *2013 Webby Award Winner: Best Guides/Ratings/Review Site* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] joblib dump compression
Awesome - thanks guys! @Gael: I'll look into the single file storage and submit a PR 2014-02-02 Olivier Grisel olivier.gri...@ensta.org: I recently contributed a fix to numpy master (to be part of numpy 1.9.0) to use nditer API to stream buffers to non-'file' file object: https://github.com/numpy/numpy/pull/4077 That should make it possible to refactor joblib to stream pickled data to GzipFile instances or use the zlib.compressobj API to do to a single file compressed joblib.dump without memory copy. I had ongoing work to fix that issue tracked by https://github.com/joblib/joblib/issues/66 . But I had to stop to work on getting the threading backend in sklearn first. I plan to resume working on joblib/joblib#66 soonish (after Strata and the sklearn 0.15 release). There is also this PR that is probably related (although I have not reviewed yet in details yet): https://github.com/joblib/joblib/pull/115 -- Olivier -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] joblib dump compression
Hi list, sorry but I didn't find a dedicated joblib mailing list and since most of the joblib contributors hang around here I thought I give it a shot. I'm using joblib to dump scikit-learn RF models. When using compression is the output always guaranteed to be stored in a single file? I looked at the source and it seems to be this way but there might be a corner case if the size of the object is too large? thanks, Peter -- Peter Prettenhofer -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Scikit-Learn for android
The structure of most learning algorithms is pretty simple (eg. linear models or decision tree ensembles). A linear classifier for text classification could be simply converted into a python dictionary where the keys are terms and values are the coefficients (``coef_``) of the linear classifier - using sparse regularization (L1) helps a lot to keep memory requirements low. Decision trees can be translated into a series of if-then-else statements that can be evaled (if you are brave). best, Peter 2014/1/20 Joel Nothman joel.noth...@gmail.com Do you have any specific use case in mind for running scikit-learn on Android? Maybe an interesting and more useful project instead would be to implement PMML (Predictive Model Markup Language) exporters. Yes, I thought in this direction too (although last time I looked at PMML I got scared off). Most of the time you just want a model that can be trained offline and deployed on Android. I'm sure there are cases where an Android app will want to perform learning online, but it might be more sensible for the statistics to be collected on the Android, and pushed to a server for modelling. On 20 January 2014 11:37, Vlad Niculae zephy...@gmail.com wrote: I don't think Weka (at least the interesting parts of it) could run on Android either. I don't really foresee the whole Scipy stack running on Android; maybe one day when all dependencies are rewritten in PyPy and are faster and still 100% compatible... One thing that would be possible (but I don't know whether it would be useful for any appliers) would be to implement a prediction-only library, so you could develop models on your PC or in the cloud, download the pickled estimator and deploy it. However I think people who need to do this end up writing the whole custom predictor; as it'd be more efficient. Do you have any specific use case in mind for running scikit-learn on Android? Maybe an interesting and more useful project instead would be to implement PMML (Predictive Model Markup Language) exporters. My 2c, Vlad On Mon Jan 20 00:24:16 2014, Olivier Grisel wrote: 2014/1/20 Tejas Nikumbh tejasniku...@gmail.com: Hi guys, Is there a way we can utilise scikit-learn in android based projects? AFAIK, no. If not , does this sound like a good idea for a project [possibly a gsoc project]? What might be the hurdles associated? Trying to build scipy and its fortran build and runtime dependencies on Android is going to be fun :) -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Releasing joblib 0.8a
Actually, I'd propose to turn off multiprocessing at prediction time - this might backfire quite easily. 2013/12/20 Olivier Grisel olivier.gri...@ensta.org 2013/12/20 Vlad Niculae zephy...@gmail.com: Works exactly as you described on my machine (which doesn't mean much because it's relatively close to yours, but I am just too enthusiastic about this not to reply! \o/) Memory usage is as expected. I see a speedup in train time but a slight slowdown in test time (1.7 vs 1.0), is it expected or probably an artefact? Threading is not (yet) used at test time as the cython code backing the predict method would need to be refactored to release the GIL to make threading efficient. So the performance speed decrease you observe might be caused by the new automated memmaping feature that dumps large arrays to use share memory with with worker process when the multiprocessing backend is used. Currently the threshold to trigger the automated memmaping is set to 1MB arrays or larger. Maybe this is too small and we should trigger it only for arrays larger than 100MB for instance. How big is the data array in your case, is this the covertype benchmark? -- Olivier -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Defining a custom correlation kernel for GaussianProcess in the form K(x, x')
Hi Ralf, unfortunately, I cannot answer your question but it would be indeed very valuabe to allow custom correlation functions. best, Peter 2013/12/9 Ralf Gunter ralfgun...@gmail.com Hi all, We're trying to use a custom correlation kernel with GP in the usual form K(x, x'). However, by looking at the built-in correlation models (and how they're used by gaussian_process.py) it seems sklearn only takes models in the form K(theta, dx). There may very well be a reformulation of our K that depends only on (x-x'), but if so it would probably be highly non-trivial as it depends on e.g. modified spherical bessel functions evaluated at a scaled product of the xs. Is there any way to have the GP module take our kernel without modifying the GP code? I apologize if this has been asked/answered before -- some searching on google only led me to models that also depend only on (x-x'). Thanks! -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Decision tree nodes labels
Hi Caleb, you need to extract the path from the decision tree structure ``DecisionTreeClassifier.tree_`` - take a look at the attributes ``children_left`` and ``children_right`` - these encode the parent-child relationship. Extracting the path is very similar to finding the leaf node; you just need to keep track of what choices you made along the way - just modify ``sklearn.tree._tree.Tree.apply`` [1] accordingly. best, Peter [1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1907 2013/12/8 Caleb cloverev...@yahoo.com Hi everyone, Given an instance (x_1,x_2,...,x_n), I want to know what about it that make the decision tree belong to certain class, ie: x_1 a, x_3 b,. = x is of class C. I notice that .apply can return the id of the leaf node that the instance falls in, but can I get the path from the root node down to this leaf node? Any idea? - Caleb -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Spark-backed implementations of scikit-learn estimators
Great news - looking forward to the outcome of the sprint! 2013/12/4 Olivier Grisel olivier.gri...@ensta.org I meant San Francisco... -- Olivier -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Array memory layout and slicing
Hi all, I'm currently modifying our tree code so that it runs on both fortran and c continuous arrays. After some benchmarking I got aware of the following numpy behavior that was contrary to what I was expecting:: X = # some feature matrix X = np.asfortranarray(X) X.flags.f_contiguous True # so far so good X_train = X[:1000] X_train.flags.f_contiguous False X_train.flags.c_contiguous False # damn - seems like a view is neither c nor fortran continuous X_train = X_train.copy() # lets materialize the view X_train.flags.f_contiguous False X_train.flags.c_contiguous True In the tree code, I check if an array is continuous - if not, I call ``np.asarray`` and set the ``order`` according to ``flags.f_contiguous`` or ``flags.c_contiguous``, however, in the case of views that does not work. How would you handle this case? thanks, Peter -- Peter Prettenhofer -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Array memory layout and slicing
2013/11/26 Olivier Grisel olivier.gri...@ensta.org 2013/11/26 Peter Prettenhofer peter.prettenho...@gmail.com: Hi all, I'm currently modifying our tree code so that it runs on both fortran and c continuous arrays. After some benchmarking I got aware of the following numpy behavior that was contrary to what I was expecting:: X = # some feature matrix X = np.asfortranarray(X) X.flags.f_contiguous True # so far so good X_train = X[:1000] X_train.flags.f_contiguous False X_train.flags.c_contiguous False # damn - seems like a view is neither c nor fortran continuous Only if you slice the rows of a fortran aligned 2D array, this is expected. If you slices the rows of a C-contiguous 2D array or the columns of a F-contiguous 2D array it stays contiguous. Actually, now that I think about it, it totally makes sense -- next time I think before I write ;-) thanks guys! import numpy as np a_c = np.arange(12).reshape(3, 4) a_f = np.asfortranarray(a_c) a_c.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False a_f.flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False a_c[1:].flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False a_f[:, 1:].flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GradientBoostingRegressor with huber-loss and subsampling
Hi Johannes, The bug was fixed recently, please use the master while there is no 0.15 release. Best, Peter Am 19.11.2013 16:33 schrieb hannithebunny hannithebu...@hotmail.de: Hi, in previous versions of scikit-learn I used GradientBoostingRegression with parameters: - loss = 'huber' - subsample =0.8 After a sklearn update to version 0.14.1, I can use the 'huber' loss-function only if subsamble=1.0. For e.g. subsample=0.8 the error message below is displayed: ... reg = GradientBoostingRegressor(loss='huber',subsample=0.8) reg.fit(X,y) Traceback (most recent call last): File C:\Users\xxx\GradientTreeRegressor.py, line 109, in module reg.fit(X,y) File C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 1126, in fit return super(GradientBoostingRegressor, self).fit(X, y) File C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 609, in fit y_pred[~sample_mask]) File C:\Python27\Lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 253, in __call__ gamma = self.gamma AttributeError: 'HuberLossFunction' object has no attribute 'gamma' Any help? Thanks and best regards Johannes -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Benchmarking non-negative least squares solvers, work in progress
SGDClassifier adopted the parameter names of ElasticNet (which has been around in sklearn for longer) for consistency reasons. I agree that we should strive for concise and intuitive parameter names such as ``l1_ratio``. Naming in sklearn is actually quite unfortunate since the popular R package glmnet uses ``alpha`` for the ``l1_ratio``... 2013/11/8 Thomas Unterthiner thomas.unterthi...@gmx.net Just my 0.02$ as a user: I was also a confused/put-off by `alpha` and `l1_ratio` when I first explored SGDClassifier, I found those names to be pretty inconsistent --- plus I tend to call my regularization parameters `lambda` and use `alpha` for learning rates. I'm sure other people associate yet other meanings with alpha/use other names for the regularization parameter. `l1_reg`/`l2_reg` would be much better/conciser names, it would be nice if those could be used all throughout sklearn. Cheers Thomas On 2013-11-08 09:20, Vlad Niculae wrote: Re: the discussion we had at PyCon.fr, I noticed that the internal elastic net coordinate descent functions are parametrized with `l1_reg` and `l2_reg`, but the exposed classes and functions have `alpha` and `l1_ratio`. Only yesterday there was somebody on IRC who couldn't match Ridge with ElasticNet because of this parametrization. On Fri, Nov 8, 2013 at 9:02 AM, Olivier Grisel olivier.gri...@ensta.org wrote: About the LBFGS-B residuals (non-)issue I was probably confused by the overlapping on the plot and mis-interpreted the location of the PG-l1 and PG-l2 curves. -- Olivier -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] LambdaMART implementation and gbm comparison
Hi Jacques, very exciting -- this was on my wish list for quite a while. maybe we should start creating a PR upfront so that we can discuss things there -- better than using the mailing list (quite a lot of traffic already). The most important part of adding lambdaMart to sklearn is fleshing out an API for learning to rank problems (ie we need to group samples by query id) -- based on past experience this will take a while ;-) . We should sync with Mathieu, Olivier, and Fabian -- if I remember correctly, we have discussed this a while ago. I've been reading through the GBM code lately to look at their best-first tree building heuristic (again) -- we can definitely share experience there -- source code is sometimes a bit verbose... We should definitely take a look at Ranklib -- seems like its doing pretty well here [1]. Otherwise, I too bench against gbm since its IMHO the reference implementation of GBRT and a pretty good one as well. IMHO part of the success of certain ML methods stems from the availability of high quality implementations -- gbm definitely counts for one, libsvm/liblinear too. [1] http://www.kaggle.com/c/expedia-personalized-sort/forums/t/6228/my-approach best, Peter PS: Lucas Eustaquio pointed me to a python lambdaMart implementation that uses sklear.tree.DecisionTreeRegressor: https://github.com/discobot/LambdaMart/blob/acb8329ab63a45d2bcb43055fa54f14b8c6725c1/mart.py 2013/11/6 Jacques Kvam jwk...@gmail.com Hello scikit-learn, I recently wrote up an implementation of the LambdaMART algorithm on top of the existing gradient boosting code (thanks for the great base of code to work with btw). It currently only supports NDCG but it would be easy to generalize. That's kind of besides the point however. Before I even think about putting together a PR I wanted to compare it against the gbm package. I'm aware of java implementations like jforest and ranklib but gbm's interface seems closest to sklearn's so that's what I want to use. Unfortunately whenever I try to use ndcg, it segfaults on me or I get an error in split.default depending on where I specify the group variable. I realize this isn't an R list but I was hoping someone could shed some light for me. I'm using the supervised MQ2007 and MQ2008 datasets from ( https://research.microsoft.com/en-us/um/beijing/projects/letor//letor4download.aspx) and my test code is here (https://gist.github.com/jwkvam/7332448). I simply use python to transform the given train.txt file into a csv so I can load it in R. I'm using gbm 2.1 and I've tried R 2.15.3 and 3.0.2. Alternatively can I easily transform my gbm.fit() call to use the gbm() interface? Sorry I'm kind of a newbie when it comes to R. I saw there's also this standing issue, but it doesn't look like there's been a lot of movement on it. https://code.google.com/p/gradientboostedmodels/issues/detail?id=28q=pairwise Thanks, Jacques -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] release time
Given that snow will arrive late I too should be able to get some stuff done as well. I want to get #2570 to MRG within one week so that we have plenty of time to review and tweak. Furthermore, I wanted to have a look a supporting different dtypes for SGD. @Olivier: I will team up with you on reviewing MARS best, Peter 2013/11/6 Lars Buitinck larsm...@gmail.com 2013/11/6 Olivier Grisel olivier.gri...@ensta.org: I can help prepare the release by going through the open issues and pull requests on github and make a summary next week. All the three PRs highlighted by Gilles seem very important to me. I started reading the ESLII chapter on MARS soon to help with the review of the PR (I got interrupted by 2 conferences but will resume soon :). As for the timing of the release I have no strong opinion. Let's target the end of the year for a start and decide later if we need to shift the release date to January. I have time in the second half of December. -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] SGDRegressor.sparsify() = ValueError: dimension mismatch
Hi Eustache, that's quite a bug - thanks for reporting - I fixed it and added a sparsify test to test_common.py - pushed directly to master. thanks, Peter 2013/11/4 Eustache DIEMERT eusta...@diemert.fr Hi List, I'm currently working on some performance documentation [1] and I wanted to micro-benchmark the dense vs sparse coefficients case. I created a self-contained script and wanted to bench it using line_profiler, but it seems that after the call to `sparsify()` my SGDRegressor can't predict anymore (crashes with a dimensions mismatch error). Here is a gist to reproduce that: [2]. The weird thing is that the coeffs_ attribute changes shape after the call to sparsify: (30,) - (1, 30) where 30 equals to n_features in my case. Any idea or explanation welcome ! [1] https://github.com/scikit-learn/scikit-learn/pull/2488 [2] https://gist.github.com/oddskool/7300982 PS: The stack trace: Traceback (most recent call last): File /usr/local/bin/kernprof.py, line 233, in module sys.exit(main(sys.argv)) File /usr/local/bin/kernprof.py, line 221, in main execfile(script_file, ns, ns) File sparsity_benchmark.py, line 52, in module score(y_test, clf.predict(X_test), 'sparse model') File /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py, line 903, in predict return self.decision_function(X) File /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py, line 888, in decision_function scores = safe_sparse_dot(X, self.coef_) + self.intercept_ File /usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py, line 190, in safe_sparse_dot ret = a * b File /usr/lib/python2.7/dist-packages/scipy/sparse/base.py, line 311, in __rmul__ return (self.transpose() * tr).transpose() File /usr/lib/python2.7/dist-packages/scipy/sparse/base.py, line 278, in __mul__ raise ValueError('dimension mismatch') ValueError: dimension mismatch -- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] C integer types: the missing manual
Hi Lars, thanks heaps! You should post this to the planet scipy RSS feed - I'm sure many people share(d) my confusion about the topic. best, Peter 2013/10/23 Lars Buitinck larsm...@gmail.com Dear all, I promised some time ago to write a guideline for using C integer types in Cython code. Here's a start; currently on the wiki instead of in a PR because of the rough state. https://github.com/scikit-learn/scikit-learn/wiki/C-integer-types:-the-missing-manual Regards, Lars -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] C integer types: the missing manual
on the website it says: To ask for your feed to be added to the planet, email Gael Varoquaux 2013/10/23 Lars Buitinck larsm...@gmail.com 2013/10/23 Peter Prettenhofer peter.prettenho...@gmail.com: You should post this to the planet scipy RSS feed - I'm sure many people share(d) my confusion about the topic. How does that work? -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression
Hi Attila, please use the following adaptor:: def __init__(self, est): self.est = est def predict(self, X): return self.est.predict_proba(X) def fit(self, X, y): self.est.fit(X, y) The one in the stackoverflow question returns an array of shape (n_samples,) but it should rather be (n_samples, n_classes). PS: I still need to fix the init issue but any solution will most likely make the GBRT slower at prediction time (especially for single instance prediction). best, Peter 2013/10/22 Attila Balogh attila.bal...@gmail.com Hi all, first of all thanks for all the developers for working on scikit-learn, it is a wonderful library. I am struggling for a while now with the following problem: Trying to use GBR with LR as a BaseEstimator, and I'm getting the following error: File main.py, line 110, in main score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4, scoring='roc_auc')) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1152, in cross_val_score for train, test in cv) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 517, in __call__ self.dispatch(function, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 312, in dispatch job = ImmediateApply(func, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 136, in __init__ self.results = func(*args, **kwargs) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1060, in _cross_val_score estimator.fit(X_train, y_train, **fit_params) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 890, in fit return super(GradientBoostingClassifier, self).fit(X, y) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 613, in fit random_state) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 486, in _fit_stage sample_mask, self.learning_rate, k=k) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 172, in update_terminal_regions y_pred[:, k]) IndexError: too many indices I have found a similar problem on stackoverflow ( http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn) and tried to implement the adaptor but it didn't help, the error remained the same. Does anyone have any ideas how to resolve this? Cheers; Attila -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression
Right, I thought you were using the multi-class loss function. Please send me a testcase so that I can investigate the issue. thanks, Peter 2013/10/22 Attila Balogh attila.bal...@gmail.com Hi Peter, thanks for your answer. I have tried this before also, and the problem is that in this case I get ValueError: operands could not be broadcast together with shapes (74) (148), because the y array is raveled and it has shape (74,2). Do you need a self containing testcase which reproduces this error? Cheers; Attila On Tue, Oct 22, 2013 at 1:16 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi Attila, please use the following adaptor:: def __init__(self, est): self.est = est def predict(self, X): return self.est.predict_proba(X) def fit(self, X, y): self.est.fit(X, y) The one in the stackoverflow question returns an array of shape (n_samples,) but it should rather be (n_samples, n_classes). PS: I still need to fix the init issue but any solution will most likely make the GBRT slower at prediction time (especially for single instance prediction). best, Peter 2013/10/22 Attila Balogh attila.bal...@gmail.com Hi all, first of all thanks for all the developers for working on scikit-learn, it is a wonderful library. I am struggling for a while now with the following problem: Trying to use GBR with LR as a BaseEstimator, and I'm getting the following error: File main.py, line 110, in main score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4, scoring='roc_auc')) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1152, in cross_val_score for train, test in cv) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 517, in __call__ self.dispatch(function, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 312, in dispatch job = ImmediateApply(func, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 136, in __init__ self.results = func(*args, **kwargs) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1060, in _cross_val_score estimator.fit(X_train, y_train, **fit_params) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 890, in fit return super(GradientBoostingClassifier, self).fit(X, y) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 613, in fit random_state) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 486, in _fit_stage sample_mask, self.learning_rate, k=k) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 172, in update_terminal_regions y_pred[:, k]) IndexError: too many indices I have found a similar problem on stackoverflow ( http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn) and tried to implement the adaptor but it didn't help, the error remained the same. Does anyone have any ideas how to resolve this? Cheers; Attila -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general
Re: [Scikit-learn-general] GradientBoostingRegressor with LogisticRegression
Ok, below is the adaptor that will work. The code requires that the output of predict is 2d. Thanks for the test-case. best, Peter class Adaptor(object): def __init__(self, est): self.est = est def predict(self, X): return self.est.predict_proba(X)[:, np.newaxis] def fit(self, X, y): self.est.fit(X, y) 2013/10/22 Peter Prettenhofer peter.prettenho...@gmail.com Right, I thought you were using the multi-class loss function. Please send me a testcase so that I can investigate the issue. thanks, Peter 2013/10/22 Attila Balogh attila.bal...@gmail.com Hi Peter, thanks for your answer. I have tried this before also, and the problem is that in this case I get ValueError: operands could not be broadcast together with shapes (74) (148), because the y array is raveled and it has shape (74,2). Do you need a self containing testcase which reproduces this error? Cheers; Attila On Tue, Oct 22, 2013 at 1:16 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi Attila, please use the following adaptor:: def __init__(self, est): self.est = est def predict(self, X): return self.est.predict_proba(X) def fit(self, X, y): self.est.fit(X, y) The one in the stackoverflow question returns an array of shape (n_samples,) but it should rather be (n_samples, n_classes). PS: I still need to fix the init issue but any solution will most likely make the GBRT slower at prediction time (especially for single instance prediction). best, Peter 2013/10/22 Attila Balogh attila.bal...@gmail.com Hi all, first of all thanks for all the developers for working on scikit-learn, it is a wonderful library. I am struggling for a while now with the following problem: Trying to use GBR with LR as a BaseEstimator, and I'm getting the following error: File main.py, line 110, in main score = np.mean(cross_validation.cross_val_score(rd, X, y, cv=4, scoring='roc_auc')) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1152, in cross_val_score for train, test in cv) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 517, in __call__ self.dispatch(function, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 312, in dispatch job = ImmediateApply(func, args, kwargs) File C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py, line 136, in __init__ self.results = func(*args, **kwargs) File C:\Python27\lib\site-packages\sklearn\cross_validation.py, line 1060, in _cross_val_score estimator.fit(X_train, y_train, **fit_params) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 890, in fit return super(GradientBoostingClassifier, self).fit(X, y) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 613, in fit random_state) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 486, in _fit_stage sample_mask, self.learning_rate, k=k) File C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py, line 172, in update_terminal_regions y_pred[:, k]) IndexError: too many indices I have found a similar problem on stackoverflow ( http://stackoverflow.com/questions/17454139/gradientboostingclassifier-with-a-baseestimator-in-scikit-learn) and tried to implement the adaptor but it didn't help, the error remained the same. Does anyone have any ideas how to resolve this? Cheers; Attila -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- October Webinars: Code
Re: [Scikit-learn-general] linear_model.SGDClassifier(): ValueError: ndarray is not C-contiguous when calling partial_fit()
great - thanks Lars - will prepare a PR 2013/10/9 Lars Buitinck larsm...@gmail.com 2013/10/8 Peter Prettenhofer peter.prettenho...@gmail.com: that's a bug - I'll open a ticket for it. A quick fix: call partial_fit instead of fit just before the ``for`` loop. Peter, is this due to an optimization that turns coef_ into a Fortran-ordered array? If so, I don't think we need it any longer with NumPy 1.7 and the new sklearn.extmath.fast_dot: In [1]: X = np.random.randn(1, 200) In [2]: Y = np.random.randn(200, 70) In [3]: %timeit np.dot(X, Y) 100 loops, best of 3: 16.5 ms per loop In [4]: Yf = asfortranarray(Y) In [5]: %timeit np.dot(X, Yf) 100 loops, best of 3: 16.7 ms per loop In [6]: numpy.__version__ Out[6]: '1.7.1' -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] linear_model.SGDClassifier(): ValueError: ndarray is not C-contiguous when calling partial_fit()
Hi Tom, that's a bug - I'll open a ticket for it. A quick fix: call partial_fit instead of fit just before the ``for`` loop. - Peter 2013/10/4 Tom Kenter tom.ken...@uva.nl Dear all, I am trying to run a linear_model.SGDClassifier() and have it update after every example it classifies. My code works for a small feature file (10 features), but when I give it a bigger feature file (some 8 features, but very sparse) it keeps giving me errors straight away, the first time partial_fit() is called. This is what I do in pseudocode: X, y = load_svmlight_file(train_file) classifier = linear_model.SGDClassifier() classifier.fit(X, y) for every test_line in test file: test_X, test_y = getFeatures(test_line) # This gives me a Python list for X # and an integer label for y print prediction: %f % = classifier.predict([test_X]) classifier.partial_fit(csr_matrix([test_X]), csr_matrix([Y_GroundTruth]) classes=np.unique(y) ) The error I keep getting for the partial_fit() line is: File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py, line 487, in partial_fit coef_init=None, intercept_init=None) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py, line 371, in _partial_fit sample_weight=sample_weight, n_iter=n_iter) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py, line 451, in _fit_multiclass for i in range(len(self.classes_))) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py, line 517, in __call__ self.dispatch(function, args, kwargs) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py, line 312, in dispatch job = ImmediateApply(func, args, kwargs) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py, line 136, in __init__ self.results = func(*args, **kwargs) File /datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py, line 284, in fit_binary est.power_t, est.t_, intercept_decay) File sgd_fast.pyx, line 327, in sklearn.linear_model.sgd_fast.plain_sgd (sklearn/linear_model/sgd_fast.c:7568) ValueError: ndarray is not C-contiguous I also tried feeding partial.fit() Python arrays, or numpy arrays (which are C-contiguous (sort=C) by default, I thought), but this gives the same result. The classes attribute is not the problem I think. The same error appears if I leave it out or if I give the right classes in hard code. I do notice that when I print the flags of the _coef array of the classifier, it says: Flags of coef_ array: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False I am sure I am doing something wrong, but really, I don't see what... Any help appreciated! Cheers, Tom -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Right place for a time-series focused algorithm?
2013/9/26 Kyle Kastner kastnerk...@gmail.com I had not thought about use inside a Pipeline - though now that you mention it, that seems like the ideal use case for an algorithm like this. Is this the PR you mentioned? https://github.com/scikit-learn/scikit-learn/pull/1454 As far as lagged features transformer - are we talking about rolling statistics? Something similar to pandas rolling_mean, rolling_apply, etc.? I have poorly reimplemented that using ```stride_tricks``` more times than I probably should have... well... I was mostly thinking of fx val at lag_1, fx at lag_2, ... so feature values at previous time steps. I will work up a gist for SAX in the next few days, and post it here. There is a nice demo of turning time-series into bitmaps which I rather like. If I linked the right issue above, I will try to hop in there and catch up on the changes. Resampling in the pipeline also opens the door for very interesting things from a time-series perspective... Kyle On Thu, Sep 26, 2013 at 6:10 AM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2013/9/25 Peter Prettenhofer peter.prettenho...@gmail.com: Hi Kyle, personally, I'd love to see SAX in sklearn or any other python library that I could easily use with sklearn. We don't have any time-series specific functionality yet (eg. lagged features transformer). So if we choose to add time-series functionality we should also consider the basics. Lets hear what the others say about this. PS: I'd not put it into decomposition but rather feature_extraction.tseries or something along those lines. I would start by implementing lagged features transformer as gist or as an example script to experiment how it would (or not) fit with the current scikit-learn API. We might have a problem though: the current Pipeline tool does not support changing the number of samples in a data which would probably be required for TS forecasting stuff. We have a similar issue for resampling transformers (for instance for dealing with class imbalance). We should probably make the Pipeline more flexible first to be able to properly address TS tasks. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Right place for a time-series focused algorithm?
Hi Kyle, personally, I'd love to see SAX in sklearn or any other python library that I could easily use with sklearn. We don't have any time-series specific functionality yet (eg. lagged features transformer). So if we choose to add time-series functionality we should also consider the basics. Lets hear what the others say about this. PS: I'd not put it into decomposition but rather feature_extraction.tseries or something along those lines. best, Peter 2013/9/25 Kyle Kastner kastnerk...@gmail.com I have recently been working with time-series data extensively and looking at different ways to model, classify, and predict different types of time-series. One algorithm I have been playing with is called SAX ( http://www.cs.ucr.edu/~eamonn/SAX.htm). It is a very straightforward algorithm (basically windowed mean with no overlap, then quantize into M levels), and I have implemented a rough version using numpy. Despite its simplicity, it is shown as being an effective data dependent transform, similar in some ways to the DWT. I think this algorithm would be a nice tie-in to sklearn, which could allow for more of sklearn's algorithms to be used on time-series type data. Also, the algorithm makes very strong claims about indexing massive datasets, finding similarities and outliers, which are all things I am planning to explore in the future. I know that FastICA is under decomposition, and is often seen in a time-series context - would symbolic aggregation fall into the decomposition camp as well? Is sklearn even the right place for this? Kyle -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Representing classifiers outside of Python
We don't have a PMML interface yet [1] - so you need to write custom code to extract internal state each individual classifier. What do you mean by performance critical (1ms, 1ms)? Do you make predictions per sample or can you buffer samples and make predictions for batches? In general, what kills performance is the overhead of python function calls - its usually way larger than the actual prediction (which usually happens in C-land). [1] http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language 2013/9/23 Fred Baba fred.b...@gmail.com I'd like to use classifiers trained via sklearn in a real-time application, performance critical application. How do I access the internal representation of trained classifiers? For linear classifiers/regressions, I can simply store the coefficients and generate the linear combination myself. For tree regressions, I can use sklearn.tree.export_graphviz. Ideally there would be an export facility for all classifiers (particularly for examining the structure of generated models). Is there a general solution way to do this? -- LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Selective multiclass
This is strange indeed - since you said you're doing text classification I suppose X is sparse? which format (csr, csc) and dtype (float64,32) are you using? The coef matrix is allocated before the sub processes are forked so you will need (n_jobs + 1) * 12 gb just for the coefs. The systemerror is quite strange though... I would expect a memory error... Lars, do you have any thoughts on this? best, Peter Am 13.08.2013 22:10 schrieb A 4rk@gmail.com: I have 64G of memory, so I do not think memory is the issue in this case. If the features are dense, the n_classes many coefficients of n_features are 12gb (if I haven't miss-calculated). - Correct, it occupies about 12.5G If they are for some reason all replicated for all cores, you would get into trouble. - Note that same is the case with njobs=2,3,4, just to clarify that without using all cores, even if structures are replicated for cores, the max available should be enough in this case atleast(n_jobs=2,3,4), correct? -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] PyStruct 0.1 released
Congrats Andy - looking forward to tinker wit it! Am 11.08.2013 19:57 schrieb Andreas Mueller amuel...@ais.uni-bonn.de: Hey everybody. I just wanted to spam the ML again and say I just released PyStruct 0.1. It contains structured support vector machines, structured perceptrons and models for multi-label prediction, graph labeling and sequence prediction. There are some examples on the website: http://pystruct.github.io/auto_examples/index.html You can now install it from the cheeseshop: pip install pystruct That should also give you ad3 and pyqpbo. You can then run the tests with nosetests pystruct Thanks to all the people who helped me make that happen :) Feedback, also to installation troubles, is very welcome! Cheers, Andy -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Pystruct website and mailing list
2013/7/12 Andreas Mueller amuel...@ais.uni-bonn.de On 07/12/2013 01:26 AM, Robert Layton wrote: Structured prediction in sklearn was one of the outcomes from the survey. Would it be a better idea to send people to pystruct, rather than implement it here? I think so. We decided that structured prediction was out of scope for sklearn, right? I tried a simple approach for encoding the inputs - which is basically tuples of nd-arrays for each instance - but I'm not sure that will really scale. I might need custom classes to encode the input. Also, the project moves way faster than sklearn does currently. Rob Zinkov asked me when pystruct will be included in scikit-learn. My answer was: never ;) I think its much better to have it as a separate project - this way you can iron out the API much faster Of course you can try to convince me otherwise once pystruct is more mature, but I think the difference in target group and input format is quite big. Also, the project has a ton of requirements - we are working to make this more manageable but having cvxopt as a hard requirement is probably necessary. About naming it scikit-struct: is there any requirement to become a scikit? Also: is there much benefit - pandas seems to be doing quite well without the brand ;) totally agree Cheers, Andy -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Question about using sample weights to fit an svm
Hi Anne, I would also expect that using uniform weights should result in the same solution as no weights -- but maybe there is an interaction with the C parameter... for this we would need to know more about the internals of libsvm and how it handles sample weights - try scaling C by ``len(y_train)`` and see what you get :-) PS: if you use the linear svm implemented by SGDClassifier(loss='hinge') you would also get this effect that uniform weights scale the regularization parameter. best, Peter 2013/7/12 Anne Dwyer anne.p.dw...@gmail.com I have been using the sonar data set (I believe this is a sample data set used in many demonstrations of machine learning.) It is a two class data set with 60 features with 208 training examples. I have a questions about using sample weights in fitting the SVM model. When I fit the model using scaled data, I get a test error of 10.3%. When I fit the model using a sample weight vector of 1/N, I get a test error of 37%. Here is the code: w=np.ones(len(y_train)) clf=svm.SVC(kernel='rbf', C=10, gamma=.01) clf.fit(x_tr_scaled,y_train) score_scaled_tr=clf.score(x_tr_scaled,y_train) score_scaled_test=clf.score(x_te_scaled,y_test) w=w/sum(w) clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True) clf1.fit(x_tr_scaled,y_train,sample_weight=w) print Training score with sample weights is , clf1.score(x_tr,y_train) print Score with sample weights is, clf1.score(x_te_scaled,y_test) What am I doing wrong here? Also, when I tried this command: Pr=predict_proba(x_tr_scaled) I get the error that predict_proba is an undefined name. However, I got it from this link: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC Any help would be appreciated. Anne Dwyer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Question about using sample weights to fit an svm
2013/7/12 Peter Prettenhofer peter.prettenho...@gmail.com Hi Anne, I would also expect that using uniform weights should result in the same solution as no weights -- but maybe there is an interaction with the C parameter... for this we would need to know more about the internals of libsvm and how it handles sample weights - try scaling C by ``len(y_train)`` and see what you get :-) PS: if you use the linear svm implemented by SGDClassifier(loss='hinge') you would also get this effect that uniform weights scale the regularization parameter. best, Peter 2013/7/12 Anne Dwyer anne.p.dw...@gmail.com I have been using the sonar data set (I believe this is a sample data set used in many demonstrations of machine learning.) It is a two class data set with 60 features with 208 training examples. I have a questions about using sample weights in fitting the SVM model. When I fit the model using scaled data, I get a test error of 10.3%. When I fit the model using a sample weight vector of 1/N, I get a test error of 37%. Here is the code: w=np.ones(len(y_train)) clf=svm.SVC(kernel='rbf', C=10, gamma=.01) clf.fit(x_tr_scaled,y_train) score_scaled_tr=clf.score(x_tr_scaled,y_train) score_scaled_test=clf.score(x_te_scaled,y_test) w=w/sum(w) clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True) clf1.fit(x_tr_scaled,y_train,sample_weight=w) print Training score with sample weights is , clf1.score(x_tr,y_train) print Score with sample weights is, clf1.score(x_te_scaled,y_test) What am I doing wrong here? Also, when I tried this command: Pr=predict_proba(x_tr_scaled) I get the error that predict_proba is an undefined name. However, I got it from this link: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC you forgot the object:: Pr = clf1.predict_proba(x_tr_scaled) Any help would be appreciated. Anne Dwyer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Peter Prettenhofer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Question about using sample weights to fit an svm
try float(len(y_train)) - seems like C default is int... Am 13.07.2013 00:10 schrieb Anne Dwyer anne.p.dw...@gmail.com: Peter, Thanks for your answers. When I scale C by len(y_train), I get the following error: ValueError: C = 0 Anne Dwyer On Fri, Jul 12, 2013 at 3:34 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi Anne, I would also expect that using uniform weights should result in the same solution as no weights -- but maybe there is an interaction with the C parameter... for this we would need to know more about the internals of libsvm and how it handles sample weights - try scaling C by ``len(y_train)`` and see what you get :-) PS: if you use the linear svm implemented by SGDClassifier(loss='hinge') you would also get this effect that uniform weights scale the regularization parameter. best, Peter 2013/7/12 Anne Dwyer anne.p.dw...@gmail.com I have been using the sonar data set (I believe this is a sample data set used in many demonstrations of machine learning.) It is a two class data set with 60 features with 208 training examples. I have a questions about using sample weights in fitting the SVM model. When I fit the model using scaled data, I get a test error of 10.3%. When I fit the model using a sample weight vector of 1/N, I get a test error of 37%. Here is the code: w=np.ones(len(y_train)) clf=svm.SVC(kernel='rbf', C=10, gamma=.01) clf.fit(x_tr_scaled,y_train) score_scaled_tr=clf.score(x_tr_scaled,y_train) score_scaled_test=clf.score(x_te_scaled,y_test) w=w/sum(w) clf1=svm.SVC(kernel='rbf', C=10, gamma=.01, probability=True) clf1.fit(x_tr_scaled,y_train,sample_weight=w) print Training score with sample weights is , clf1.score(x_tr,y_train) print Score with sample weights is, clf1.score(x_te_scaled,y_test) What am I doing wrong here? Also, when I tried this command: Pr=predict_proba(x_tr_scaled) I get the error that predict_proba is an undefined name. However, I got it from this link: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC Any help would be appreciated. Anne Dwyer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Paris Sprint location
I plan on merging some of the GBRT PRs and praise Gilles new decision tree impl. 2013/7/11 Lars Buitinck l.j.buiti...@uva.nl 2013/7/11 Mathieu Blondel math...@mblondel.org: What is everyone planning to work on? Just curious :) Py3 was my aim, but that seems to be almost tackled, so I guess I'll concentrate on getting my proposed scorer API in master. I might want to try my hand at implementing quadratic features in FeatureHasher. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Extremely poor SVM performance
What is actually quite interesting is that the worst model has AUC of 0.29 which is actually AUC 0.71 if you invert the predictions. 2013/7/8 Olivier Grisel olivier.gri...@ensta.org Alternatively you can use the `score_func=f1_score` in 0.13 look for models that trade off precision and recall on unbalanced datasets. -- Olivier -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] RandomForests - where do we select a subset of features during fitting?
Hi Ian, 2013/7/7 Ian Ozsvald i...@ianozsvald.com Hi all. I'm following the RandomForest code (in dev from a 1 week old checkout). As I understand it (and similar to the previous post - I have some RF usage experience but nothing fundamental), RF uses a weighted sample of examples to learn *and* a random subset of features when building its decision trees. correct - although weighted samples are optional - usually, RF takes a bootstrap sample and this is implemented via sample_weights (e.g. a sample that is picked two times for the bootstrap has weight 2.0) Does the scikit-learn implementation use a random subset of features? I've followed the code in forest.py and I can't find where the choice might be made. I haven't looked at the C code for the DecisionTree. Its in the implementation of DecisionTree - see sklearn/tree/_tree.pyx - look for the for loop over ``features``. I'm interested to learn the lower bound of the number of random features that can be chosen. could you elaborate on that? I'm also curious to understand where we can restrict the depth of the RandomForest classifier. All I can see is that in forest.py the constructor takes but ignores the max_depth argument: class RandomForestClassifier(ForestClassifier): ... def __init__(self, n_estimators=10, criterion=gini, max_depth=None, ... super(RandomForestClassifier, self).__init__( base_estimator=DecisionTreeClassifier(), ... base.py._make_estimator just clones the existing base_estimator. Am I missing something? after cloning it calls ``set_params`` with ``estimator_params`` - ``'max_depth'`` is one of those. best, Peter Thanks for listening, Ian. -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Questions for plot_forest_iris.py and AdaBoost
2013/7/7 Ian Ozsvald i...@ianozsvald.com Hi all. I have a couple of questions about the demo image for the AdaBoost classifier in the dev branch: http://scikit-learn.org/dev/auto_examples/ensemble/plot_forest_iris.html I've worked through the underlying code, I understand what's being plotted, I think the AdaBoost example (final column) is in error. I figured checking my reasoning made sense before filing a bug report (I have some possible patches too). The first column is for a DecisionTree (with no limits on tree depth), the plot makes sense. The second and third columns are for a RandomForest and ExtraTrees classifier (with DecisionTrees with no depth limit). The plots for columns 2 and 3 are made by iterating over the 30 classifiers and plotting each decision surface with an alpha of 0.1. The fourth column is for an AdaBoost classifier using a DecisionTree with no limit on max depth. The plots in this column don't look right - the red regions clearly encompass where the yellow dots are drawn (this is particularly obvious in the bottom-right plot). The problem is that the weights for the ensemble of classifiers in AdaBoost aren't taken into account, I believe the alpha value for the plot should use these weights. This raises another problem but let me check first - does my logic (weights being required for the plot to make sense) sound ok? I think you are correct - we should definitely fix that - lets create an issue for that. Checking clf.score (and calling clf.predict in the yellow regions) show that the underlying classifications are correct (in the yellow regions with AdaBoost the yellow class is chosen). I'm pretty confident it is just the display that's in error. I guess possibly the display is meant to force the user to question why the classifications look wrong and to reason about the weights in AdaBoost, but I'm probably overthinking this! Regards, Ian. -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Questions for plot_forest_iris.py and AdaBoost
Issue is here https://github.com/scikit-learn/scikit-learn/issues/2133 2013/7/7 Peter Prettenhofer peter.prettenho...@gmail.com 2013/7/7 Ian Ozsvald i...@ianozsvald.com Hi all. I have a couple of questions about the demo image for the AdaBoost classifier in the dev branch: http://scikit-learn.org/dev/auto_examples/ensemble/plot_forest_iris.html I've worked through the underlying code, I understand what's being plotted, I think the AdaBoost example (final column) is in error. I figured checking my reasoning made sense before filing a bug report (I have some possible patches too). The first column is for a DecisionTree (with no limits on tree depth), the plot makes sense. The second and third columns are for a RandomForest and ExtraTrees classifier (with DecisionTrees with no depth limit). The plots for columns 2 and 3 are made by iterating over the 30 classifiers and plotting each decision surface with an alpha of 0.1. The fourth column is for an AdaBoost classifier using a DecisionTree with no limit on max depth. The plots in this column don't look right - the red regions clearly encompass where the yellow dots are drawn (this is particularly obvious in the bottom-right plot). The problem is that the weights for the ensemble of classifiers in AdaBoost aren't taken into account, I believe the alpha value for the plot should use these weights. This raises another problem but let me check first - does my logic (weights being required for the plot to make sense) sound ok? I think you are correct - we should definitely fix that - lets create an issue for that. Checking clf.score (and calling clf.predict in the yellow regions) show that the underlying classifications are correct (in the yellow regions with AdaBoost the yellow class is chosen). I'm pretty confident it is just the display that's in error. I guess possibly the display is meant to force the user to question why the classifications look wrong and to reason about the weights in AdaBoost, but I'm probably overthinking this! Regards, Ian. -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Meaning of l1_ratio in SGDRegressor
Andy, can you comment on this? Seems like the l1_ratio is indeed not correct - code is a bit confusing since we change rho - l1_ratio - rho again... We should open an issue for that. 2013/7/2 Mark Levy mark.l...@mendeley.com Hi there, In the docstring of SGDRegressor it says l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. But looking at the implementation, self.l1_ratio is passed as the value of the rho argument to plain_sgd(), and there I see: if penalty_type == L2: rho = 1.0 elif penalty_type == L1: rho = 0.0 Is there some confusion here, aside from in my head? Thanks! Mark -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Adding Sparse Autoencoder to Scikit
I strongly recommend reading Jake's blog entries on Cython (Memoryviews in particular) [1] and Wes' blog [2],[3]. Another great resource is the ball_tree.pyx code in /sklearn/neighbors/ball_tree.pyx . when you compile the pyx file to c using cython you should use the flag -a - it will generate a html file that shows what C code has been generated for the corresponding Cython statements. best, Peter [1] http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/ [2] http://wesmckinney.com/blog/?p=215 [3] http://wesmckinney.com/blog/?p=215 2013/6/26 Robert Layton robertlay...@gmail.com The basics of cython are, and I'm not kidding here, quite easy to learn. Steps: 1) Rename .py file to .pyc 2) Put int in front of all object declarations that will be integers, float in front of things that are floats. (If you know java/C/C++ etc, this will feel really natural) 3) Compile with cython - *cython filename.pyc* 4) Done. After that, it gets slightly more complicated -- i.e. importing properly and using cdef etc. I can never remember the method to do numpy arrays, but google helps with that. Good luck! On 26 June 2013 03:27, Issam issamo...@gmail.com wrote: Very helpful information! Thanks @Olivier! I'll do my best! -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Public key at: http://pgp.mit.edu/ Search for this email address and select the key from 2011-08-19 (key id: 54BA8735) -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding
? you already use one-hot encoding in your example ( preprocessing.OneHotEncoder) 2013/6/21 Maheshakya Wijewardena pmaheshak...@gmail.com can anyone give me a sample algorithm for one hot encoding used in scikit-learn? On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: you can try an ordinal encoding instead - just map each categorical value to an integer so that you end up with 8 numerical features - if you use enough trees and grow them deep it may work 2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com And yes Gilles, It is the Amazon challenge :D On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: The shape of X after encoding is (32769, 16600). Seems as if that is too big to be converted into a dense matrix. Can Random forest handle this amount of features? On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel olivier.gri...@ensta.org wrote: 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl: 2013/6/20 Olivier Grisel olivier.gri...@ensta.org: Actually twice as much, even on a 32-bit platform (float size is always 64 bits). The decision tree code always uses 32 bits floats: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 but you have to cast your data to `dtype=np.float32` in fortran layout ahead of time to avoid the memory copy. OneHot produces np.float, though, which is float64. Alright but you could convert it to np.float32 before calling toarray. But anyway this kind of sparsity level is unsuitable for random forests anyways I think. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding
Hi, seems like your sparse matrix is too large to be converted to a dense matrix. What shape does X have? How many categorical variables do you have (before applying the OneHotTransformer)? -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding
you can try an ordinal encoding instead - just map each categorical value to an integer so that you end up with 8 numerical features - if you use enough trees and grow them deep it may work 2013/6/20 Maheshakya Wijewardena pmaheshak...@gmail.com And yes Gilles, It is the Amazon challenge :D On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: The shape of X after encoding is (32769, 16600). Seems as if that is too big to be converted into a dense matrix. Can Random forest handle this amount of features? On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel olivier.gri...@ensta.org wrote: 2013/6/20 Lars Buitinck l.j.buiti...@uva.nl: 2013/6/20 Olivier Grisel olivier.gri...@ensta.org: Actually twice as much, even on a 32-bit platform (float size is always 64 bits). The decision tree code always uses 32 bits floats: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 but you have to cast your data to `dtype=np.float32` in fortran layout ahead of time to avoid the memory copy. OneHot produces np.float, though, which is float64. Alright but you could convert it to np.float32 before calling toarray. But anyway this kind of sparsity level is unsuitable for random forests anyways I think. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] test failed after installaing scikit
could it be that the folder you're in (~/scikit-learn) contains the scikit-learn sources? 2013/6/6 linxpwww linxp...@163.com all, IN my ubuntu(uname -a): Linux ubuntu 3.2.0-29-generic-pae #46-Ubuntu SMP Fri Jul 27 17:25:43 UTC 2012 i686 i686 i386 GNU/Linux, after installing the scikit-learn from source package followd by https://pypi.python.org/pypi/scikit-learn/ , run 'nosetests --exe sklearn' ,following error happens: root@ubuntu:~/scikit-learn# nosetests --exe sklearn E == ERROR: Failure: ImportError (No module named _check_build ___ Contents of /root/scikit-learn/sklearn/__check_build: _check_build.pyx setup.pyc __init__.py _check_build.c__init__.pyc setup.py ___ It seems that scikit-learn has not been built correctly. If you have installed scikit-learn from source, please do not forget to build the package before using it: run `python setup.py install` or `make` in the source directory. If you have used an installer, please check that it is suited for your Python version, your operating system and your platform.) -- Traceback (most recent call last): File /usr/lib/python2.7/dist-packages/nose/loader.py, line 390, in loadTestsFromName addr.filename, addr.module) File /usr/lib/python2.7/dist-packages/nose/importer.py, line 39, in importFromPath return self.importFromDir(dir_path, fqname) File /usr/lib/python2.7/dist-packages/nose/importer.py, line 86, in importFromDir mod = load_module(part_fqname, fh, filename, desc) File /root/scikit-learn/sklearn/__init__.py, line 31, in module from . import __check_build File /root/scikit-learn/sklearn/__check_build/__init__.py, line 46, in module raise_build_error(e) File /root/scikit-learn/sklearn/__check_build/__init__.py, line 41, in raise_build_error %s % (e, local_dir, ''.join(dir_content).strip(), msg)) ImportError: No module named _check_build ___ Contents of /root/scikit-learn/sklearn/__check_build: _check_build.pyx setup.pyc __init__.py _check_build.c__init__.pyc setup.py ___ It seems that scikit-learn has not been built correctly. If you have installed scikit-learn from source, please do not forget to build the package before using it: run `python setup.py install` or `make` in the source directory. If you have used an installer, please check that it is suited for your Python version, your operating system and your platform. -- Ran 1 test in 0.001s FAILED (errors=1) There is no any errors during building and installing, could you help me? Thanks Aaron -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Hi Christian, I believe more in my results than in my expertise - and so should you :-) ** I think you misunderstood me: I did not claim that one-hot encoded categorical features give better results than ordinal encoded ones - I just claimed that ordinal encoding works as good as one-hot encoded features given that you have deep enough trees. But I've to warn you: I cannot support my claim with (sufficient) data. So at the end of the day, its always best to make an experiment and test it on your problem at hand. Anyways, I cannot really see your problem (or what you did wrong): according to your description it seems that the specific encoding (one-hot vs. ordinal) has no influence on the effectiveness of the model (no significant difference)? This is in line with observations by others. Andy raised a very important point though: if you optimized your hyperparameters (tree depth, min split size, ..) on the ordinal encoding and then tested those hyperparameters on a one-hot encoding you are giving an advantage to the ordinal encoding. HTH, Peter ** that being said, I'm still quite skeptical when it comes to my results 2013/6/4 Christian Jauvin cjau...@gmail.com Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I mentioned earlier, I picked another subset of 5 features, this time all with small cardinality (5, 5, 6, 11 and 12), and all purely categorical (i.e. clearly not ordered). The one-hot encoding should clearly help with such a configuration. But again, what I observe when I pit the fully one-hot encoded RF (21000 x 39) against the ordinal-encoded one (21000 x 5) is that they're behaving almost the same, in terms of accuracy and AUC, with 10-fold cross-validation. In fact, the ordinal version even seems to perform very slightly better, although I don't think it's significant. I really believe in your expertise more than in my results, so what could I be doing wrong? On 3 June 2013 04:56, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: Our decision tree implementation only supports numerical splits; i.e. if tests val threshold . Categorical features need to be encoded properly. I recommend one-hot encoding for features with small cardinality (e.g. 50) and ordinal encoding (simply assign each category an integer value) for features with large cardinality. This seems to be the opposite of what the kaggle tutorial suggests, right? They suggest ordinal encoding for small cardinality, but don't suggest any other way. Your and Gilles' feedback make me think we should tell the kaggle people to change their tutorial -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] ROC for OneClassSVM
Hi Carlos, take a look at the species distribution example [1]. Summary: Use ``OneClassSVM.decision_function`` - you don't necessarily need probabilities for ROC/AUC - confidence values are fine. best, Peter [1] http://scikit-learn.org/stable/auto_examples/applications/plot_species_distribution_modeling.html#example-applications-plot-species-distribution-modeling-py 2013/5/7 ctme...@unizar.es OK, thank you. I will do it in that way Carlos Quoting scikit-learn-general-requ...@lists.sourceforge.net: Today's Topics: 1. Re: ROC for OneClassSVM (Andreas Mueller) -- Message: 1 Date: Mon, 06 May 2013 12:33:03 +0200 From: Andreas Mueller amuel...@ais.uni-bonn.de Subject: Re: [Scikit-learn-general] ROC for OneClassSVM To: scikit-learn-general@lists.sourceforge.net Message-ID: 518786df.7000...@ais.uni-bonn.de Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 05/06/2013 12:27 PM, ctme...@unizar.es wrote: Hello, I would like to use OneClassSVM for novelty detection. I have some 'normal' data for fitting the classifier. Then I have 'normal' and 'abnormal' data for testing the performance. I would like to use the area under the ROC curve as the figure of merit of the detector. The function roc_curve needs the predicted probability. I have read that the probability can be obtained if the classifier is obtained with the parameter probability = True. However, I get an error when I try to pass this parameter. I am using version 0.10 of sklearn. For instance: import sklearn import sklearn.metrics import scipy import sklearn.svm X = scipy.random.randn(100, 2) X_train = scipy.r_[X + 2, X - 2] clf = sklearn.svm.OneClassSVM(nu=0.1, kernel=rbf, gamma=0.1, probability=True) Then I get an error. I have also tried clf = sklearn.svm.OneClassSVM(nu=0.1, kernel=rbf, gamma=0.1) clf.fit(X_train, probability=True) but it is again an error. Is that option available for OneClassSVM? If not, how could I draw the ROC? Could I sweep a threshold on the distance to the hyperplane given by clf.decision_function? Yes, I think this is what you should do. Hth, Andy -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Better prediction probabilities with SVM
2013/5/7 Lars Buitinck l.j.buiti...@uva.nl 2013/5/7 Peter Prettenhofer peter.prettenho...@gmail.com: Do you need probabilities? You could just use the signed distance to each OVA hyperplane (via ``clf.decision_function()``) to rank the classes. Maybe the platt-scaling screws up here... The more I find out about Platt scaling in LibSVM, the more I'm inclined to stay away from it. You could also look at Mathieu's lightning project https://github.com/mblondel/lightning - it features multinomial logistic regression which might give better calibrated probabilities than platt scaling... Or our own LogisticRegression. It cuts some corners, but sometimes it's good enough. Right, it should give you the same ordering as ``decision_function`` (just normalized). -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GSoC 2013 : Multinomial Logistic Regression
2013/5/2 Mathieu Blondel math...@mblondel.org On Thu, May 2, 2013 at 5:21 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: this looks pretty awesome - especially the dataset abstraction is pretty neat - would be great if we could merge this into scikit-learn. Merging the dataset abstraction would be nice. We could port some of scikit-learn's code to it, including SGD and mini-batch k-means. The neural network PR by Lars could also benefit it. totally agree - I can raise this issue and work on it at the sprint - shouldn't take too long - we would need to port SGD first anyways. BTW, do you think we should keep the weight vector abstraction which is in scikit-learn? The idea behind the abstraction was to implement averaged SGD/Perceptron easily - I didn't finish the PR though... So I guess the answer is: no btw: what kind of truncated gradient algorithm does lightning use for L1 penalized SGD? As far as I can see its not the one that's currently used in SGDClassifier... It's the regular truncated SGD by Jonh Langford, which is identical to the method described in the FOBOS paper. Compared to the one in scikit-learn, it is more theoretically correct. The one in scikit-learn obtains sparser weight vectors in practice but has no theoretical justification (it's an heuristic). My goal was to compare coordinate descent with regular truncated/projected SGD so I didn't implement this heuristic. ok - probably better to use this one (or the projection based method by Duchi) - on the other hand, the Tsuruoka et al method served me quite well in the past thx, Peter Mathieu -- Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET Get 100% visibility into your production application - at no cost. Code-level diagnostics for performance bottlenecks with 2% overhead Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap1 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET Get 100% visibility into your production application - at no cost. Code-level diagnostics for performance bottlenecks with 2% overhead Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap1___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Effects of shifting and scaling on Gradient Descent
learning toolkit. Gradient descent is a general class of optimization algorithms. Ga?l -- sp -- sp -- sp -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Distributed RandomForests
Hi Youssef, please make sure that you use the latest version of sklearn (= 0.13) - we did some enhancements to the sub-sampling procedure lately. Looking at the RandomForest code - it seems that the jobs=-1 should not be the issue for the parallel training of the trees since ``n_jobs = min(cpu_count(), self.n_estimators)`` which should be just 3 in your case, however, it will use cpu_count() processes to sort the feature values - so the bottleneck might be here. Please try to set the n_jobs parameter to a smaller constant (e.g. 4) and check if it works better. having said that: 1E8 samples is pretty large - the largest dataset that I've used so far was merely 1E6 but I've heard that people have used it for larger datasets too (probably not 1E8 though). Running the code on a cluster using IPython parallel should not be too hard - RF is a pretty simple algorithm - you could either patch the existing code to use IPython parallel instead of Joblib.Parallel (see forest.py) or simply write you own RF code which directly uses ``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it doesn't help much IMHO and can make the implementation a bit more involved - AFAIK the MSR guys didn't used boostrapping for their Kinect RF system... When it comes to other implementations you could look at rt-rank [1], which is a parallel implementation of both GBRT and RF; and WiseRF [2], which is compatible with sklearn but you have to obtain a license (free trial and academic version AFAIK). HTH, Peter [1] https://sites.google.com/site/rtranking/ [2] http://about.wise.io/ Am 25.04.2013 03:22 schrieb Youssef Barhomi youssef.barh...@gmail.com: Hello, I am trying to reproduce the results of this paper: http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with different kinds of data (monkey depth maps instead of humans). So I am generating my depth features and training and classifying data with a random forest with quite similar parameters of the paper. I would like to use sklearn.ensemble.RandomForestClassifier with 1E8 samples with 500 features. Since it seems to be a large dataset of feature vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and the last one seemed to be slower than a O(n_samples*n_features*log(n_samples)) according to this: http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6 samples are taking a long time and I don't know when they will be done, I would like better ways to estimate the ETA or find a way to speed up the processing training. Also, I am watching my memory usage and I don't seem to be swapping (29GB/48GB being used right now). The other thing is that I requested n_jobs = -1 so it could use all cores of my machine (24 cores) but looking to my CPU usage, it doesn't seem to be using any of them... So, do you guys have any ideas on: - would a 1E8 samples be doable with your implementation of random forests (3 trees , 20 levels deep)? - running this code on a cluster using different iPython engines? or would that require a lot of work? - PCA for dimensionality reduction? (on the paper, they haven't used any dim reduction, so I am trying to avoid that) - other implementations that I could use for large datasets? PS: I am very new to this library but I am already impressed!! It's one of the cleanest and probably most intuitive machine learning libraries out there with a pretty impressive documentation and tutorials. Pretty amazing work!! Thank you very much, Youssef ###Here is a code snippet: from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler import time import numpy as np n_samples = 1000 n_features = 500 X, y = make_classification(n_samples, n_features, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1) clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion = 'entropy', n_jobs = -1, verbose = 10) rng = np.random.RandomState(2) X += 2 * rng.uniform(size=X.shape) linearly_separable = (X, y) X = StandardScaler().fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) tic = time.time() clf.fit(X_train, y_train) score = clf.score(X_test, y_test) print 'Time taken:', time.time() - tic, 'seconds' -- Youssef Barhomi, MSc, MEng. Research Software Engineer at the CLPS department Brown University T: +1 (617) 797 9929 | GMT -5:00 -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with
Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data
I totally agree with Brian - although I'd suggest you drop option 3) because it will be a lot of work. I'd suggest you rather should do a) feature extraction or b) feature selection. Personally, I think decision trees in general and random forest in particular are not a good fit for sparse datasets - if the average number of non-zero values for each feature is low than your partitions will be relatively small - any subsequent splits will make the partitions even smaller thus you cannot grow your trees deep since you will run out of samples. This means that your tree in fact uses just a tiny fraction of the available features (compared to a deep tree) - unless you have a few pretty strong features or you train lots of trees this won't work out. This is probably also the reason why most of the decision tree work in natural language processing is done using boosted decision trees of depth one. If your features are boolean than such a model is in fact pretty similar to a simple logistic regression model. I've the impression that Random Forest in particular is a poor evidence accumulator (pooling evidence from lots of weak features) - linear models and boosted trees are much better here. best, Peter 2013/4/24 Brian Holt bdho...@gmail.com At the moment your three options are 1) get more memory 2) do feature selection - 400k features on 200k samples seems to me to contain a lot of redundant information or irrelevant features 3) submit a PR to support dense matrices - this is going to be a lot of work and I doubt it's worth it. All the best Brian On Apr 24, 2013 5:14 AM, Calvin Morrison mutanttur...@gmail.com wrote: get more memory? On 23 April 2013 17:06, Alex Kopp ark...@cornell.edu wrote: Hi, I am looking to build a random forest regression model with a pretty large amount of sparse data. I noticed that I cannot fit the random forest model with a sparse matrix. Unfortunately, a dense matrix is too large to fit in memory. What are my options? For reference, I have just over 400k features and just over 200k training examples -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data
2013/4/24 Olivier Grisel olivier.gri...@ensta.org 2013/4/24 Peter Prettenhofer peter.prettenho...@gmail.com: I totally agree with Brian - although I'd suggest you drop option 3) because it will be a lot of work. I'd suggest you rather should do a) feature extraction or b) feature selection. Personally, I think decision trees in general and random forest in particular are not a good fit for sparse datasets - if the average number of non-zero values for each feature is low than your partitions will be relatively small - any subsequent splits will make the partitions even smaller thus you cannot grow your trees deep since you will run out of samples. This means that your tree in fact uses just a tiny fraction of the available features (compared to a deep tree) - unless you have a few pretty strong features or you train lots of trees this won't work out. This is probably also the reason why most of the decision tree work in natural language processing is done using boosted decision trees of depth one. If your features are boolean than such a model is in fact pretty similar to a simple logistic regression model. I've the impression that Random Forest in particular is a poor evidence accumulator (pooling evidence from lots of weak features) - linear models and boosted trees are much better here. Very interesting consideration. Any reference paper to study this in more details (both theory and empirical validation)? actually, no - just gut feeling based on how decision trees / RF works (hard non-intersecting partitions) - I will try to digg something up - would definitely like to hear any critics/remarks to my view though. Also do you have good paper that demonstrate state of the art results with boosted stumps for NLP? I haven't seen any use of boosted stumps in NLP for a while - but maybe I didn't pay close attention - what comes to my mind is some work by Xavier Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task - actually, a number of participants used boosting/trees). Joseph Turian used boosting in his thesis on parsing [2]. [1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf [2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data
Have you tried tuning the hyper-parameters of the SGDRegressor? You really need to tune the learning rate for SGDRegressor (SGDClassifier has a pretty decent default). E.g. set up a grid search w/ a constant learning rate and try different values of eta0 ([0.1, 0.01, 0.001, 0.0001]). You can also set verbose=3 to see the loss after each epoch which you can use to check the convergence. 2013/4/24 Alex Kopp ark...@cornell.edu Thanks, guys. Perhaps I should explain what I am trying to do and then open it up for suggestions. I have 203k training examples each with 457k features. The features are composed of one-hot encoded categorical values as well as stemmed, TFIDF weighted unigrams and bigrams (NLP). As you can probably guess, the overwhelming majority of the features are the unigrams and bigrams. In the end, I am looking to build a regression model. I have tried a grid search on SGDRegressor, but have not had any promising results (~0.00 or even negative R^2 values). I would appreciate ideas/suggestions. Thanks ps, if it matters, I have 8 cores and 52gb ram at my disposal. On Wed, Apr 24, 2013 at 5:32 AM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: 2013/4/24 Olivier Grisel olivier.gri...@ensta.org 2013/4/24 Peter Prettenhofer peter.prettenho...@gmail.com: I totally agree with Brian - although I'd suggest you drop option 3) because it will be a lot of work. I'd suggest you rather should do a) feature extraction or b) feature selection. Personally, I think decision trees in general and random forest in particular are not a good fit for sparse datasets - if the average number of non-zero values for each feature is low than your partitions will be relatively small - any subsequent splits will make the partitions even smaller thus you cannot grow your trees deep since you will run out of samples. This means that your tree in fact uses just a tiny fraction of the available features (compared to a deep tree) - unless you have a few pretty strong features or you train lots of trees this won't work out. This is probably also the reason why most of the decision tree work in natural language processing is done using boosted decision trees of depth one. If your features are boolean than such a model is in fact pretty similar to a simple logistic regression model. I've the impression that Random Forest in particular is a poor evidence accumulator (pooling evidence from lots of weak features) - linear models and boosted trees are much better here. Very interesting consideration. Any reference paper to study this in more details (both theory and empirical validation)? actually, no - just gut feeling based on how decision trees / RF works (hard non-intersecting partitions) - I will try to digg something up - would definitely like to hear any critics/remarks to my view though. Also do you have good paper that demonstrate state of the art results with boosted stumps for NLP? I haven't seen any use of boosted stumps in NLP for a while - but maybe I didn't pay close attention - what comes to my mind is some work by Xavier Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task - actually, a number of participants used boosting/trees). Joseph Turian used boosting in his thesis on parsing [2]. [1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf [2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based
Re: [Scikit-learn-general] Our own Olivier Grisel giving a scipy keynote
That's great - congratulations Olivier! Definitely, no pressure ;-) 2013/4/17 Ronnie Ghose ronnie.gh...@gmail.com wow :O congrats On Tue, Apr 16, 2013 at 7:17 PM, Mathieu Blondel math...@mblondel.orgwrote: Very well-deserved. Congrats! On Wed, Apr 17, 2013 at 4:48 AM, Gael Varoquaux gael.varoqu...@normalesup.org wrote: I have been somewhat living under a rock lately, so I am not sure that it has been around this mailing list: @ogrisel is giving a keynote at scipy this year. http://conference.scipy.org/scipy2013/keynotes.php Hurray! Congratulations Olivier -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Sparse Matrix Formats
2013/4/15 Vlad Niculae zephy...@gmail.com It really depends on each estimator and there is not one format that's better every time. It's the same as with dense arrays, with C versus Fortran ordering. I did a quick check on the supervised methods: the coordinate descent methods (ElasticNet, Lasso) use CSC format for sparse and Fortran format for dense data. All others (SGD, LinearSVC, SVC, NaiveBayes, Ridge) assume CSR format for sparse and C format for dense. Unfortunately I can't give an example off the top of my head; but I think that between SVC, LinearSVC and SGDClassifier, two of them must disagree on this. Best way to know is to thoroughly check the docs of the objects you're working in. If nothing is said there, go to the source code and maybe the first couple of lines will clue you in. Algorithms that have already been optimized for a specific format will usually convert the data to that format before starting with ``utils.check_arrays``. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L127 Cheers, Vlad On Mon, Apr 15, 2013 at 4:00 AM, Philipp Singer kill...@gmail.com wrote: Afaik scikit learn works with csr matrices internally as many mathematical operations are just possible for csr matrices. Am 14.04.2013 20:01, schrieb Alex Kopp: Is there a sparse matrix format that is most efficient for sklearn? (COO vs CSR vs LIL) Thanks -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] [Broken] scikit-learn/scikit-learn#1530 (master - af674ac)
Seems like travis has troubles with fetching from mldata (again) - can I ignore it or should I trigger the travis build again and hope it will work out? thx, Peter 2013/4/9 Travis CI notificati...@travis-ci.org ** The build was broken. Repository scikit-learn/scikit-learn Build #1530 https://travis-ci.org/scikit-learn/scikit-learn/builds/6182009http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=https%3A%2F%2Ftravis-ci.org%2Fscikit-learn%2Fscikit-learn%2Fbuilds%2F6182009url_id=2ebb6f2bf18562c31808d347b9e0c17f98c54489tags=_all,_sendnotificati...@travis-ci.org,production Changeset https://github.com/scikit-learn/scikit-learn/compare/382f74c9600f...af674acc878bhttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=https%3A%2F%2Fgithub.com%2Fscikit-learn%2Fscikit-learn%2Fcompare%2F382f74c9600f...af674acc878burl_id=d3a6d5514b5375d88cb21d4af0c9acaebc9b8791tags=_all,_sendnotificati...@travis-ci.org,production Commit af674ac (master) Message get rid of ``rho`` in sgd documentation - has been replaced by ``l1_ratio`` Author Peter Prettenhofer Duration 4 minutes and 47 seconds You can configure recipients for build notifications in your configuration filehttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Fabout.travis-ci.org%2Fdocs%2Fuser%2Fbuild-configurationurl_id=d5ea037b4dd9f159cc222c92df5922c6f4a198f7tags=_all,_sendnotificati...@travis-ci.org,production. Further documentation about Travis CI can be found herehttp://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Fabout.travis-ci.org%2Fdocsurl_id=b55fc489d79553a340ffa5ece9dbec09f486810ctags=_all,_sendnotificati...@travis-ci.org,production. For help please join our IRC channel irc.freenode.net#travis. We need your help! Travis CI has run 406,714 tests for 5,442 OSS projects to date, including Ruby, Rails, Rubinius, Rubygems, Bundler, Node.js, Leiningen, Symfony ... If you use any of these then you benefit from Travis CI. Please donate so we can make Travis CI even better!http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Flove.travis-ci.orgurl_id=39ffd8df3c2757a0a1e8486c466a38e2f1543889tags=_all,_sendnotificati...@travis-ci.org,production See all of our sponsors →http://mandrillapp.com/track/click.php?u=30007208id=d84119cf7033433883f50c58c2f6f6c0url=http%3A%2F%2Flove.travis-ci.org%2Fsponsorsurl_id=3ec963eaf6db187d3896baf03386e5e6b3a1ff4etags=_all,_sendnotificati...@travis-ci.org,production -- Peter Prettenhofer -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] [Broken] scikit-learn/scikit-learn#1530 (master - af674ac)
Ok - thanks! 2013/4/9 Olivier Grisel olivier.gri...@ensta.org 2013/4/9 Peter Prettenhofer peter.prettenho...@gmail.com Seems like travis has troubles with fetching from mldata (again) - can I ignore it or should I trigger the travis build again and hope it will work out? You can ignore. The problem is actually not that travis has troubles with fetching from mldata. The problem is that running the doctests on travis ignores the fixture [1] that should be enabled by the setup.cfg file [2]. This fixture (that should install a mock urllib2.urlopen function to avoid using the network) has always been working on all the workstations I used and is working on jenkins as well. There is something in the travis environment that makes it not run though. No idea what. [1] https://github.com/scikit-learn/scikit-learn/blob/master/doc/datasets/mldata_fixture.py [2] https://github.com/scikit-learn/scikit-learn/blob/master/setup.cfg#L16 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 39, Issue 13
Hi, I haven't used libFM (Factorization Machines) myself but I've heard that others have used them quite successfully. Corey (Lynch) created cython bindings for libFM https://github.com/coreylynch/pyLibFM best, Peter 2013/4/8 Andreas Mueller amuel...@ais.uni-bonn.de Factorization machines is a 2010 paper with 20 citations. I think that is a clear no. -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] SO question for the tree growers
I posted a brief description of the algorithm. The method that we implement is briefly described in ESLII. Gilles is the expert here, he can give more details on the issue. 2013/4/4 Olivier Grisel olivier.gri...@ensta.org The variable importance in scikit-learn's implementation of random forest is based on the proportion of samples that were classified by the feature at some point in one of the decision trees evaluation. http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation This method seems different from the OOB based method of Breiman 2001 (section 10): http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf Is there any reference for the method implemented in the scikit? Here is the original Stack Overflow question: http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined/15811003?noredirect=1#comment22487062_15811003 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] OOB score in gradient boosting models
Hi Yanir, thanks for raising this issue. I've implemented this feature without much though; furthermore, I haven't used OOB estimates in my work yet. I need to think more deeply about the issue - will come back to you. You propose to update ``y_pred`` only for the in-bag samples, correct? best, Peter 2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de: Hi Yanir. I was not aware that GradientBoosting had oob scores. Is that even possible / sensible? It definitely does not do what it promises :-/ Peter, any thoughts? Cheers, Andy On 03/22/2013 11:39 AM, Yanir Seroussi wrote: Hi, I'm new to the mailing list, so I apologise if this has been asked before. I want to use the oob_score_ in GradientBoostingRegressor to determine the optimal number of iterations without relying on an external validation set, so I set the subsample parameter to 0.5 and trained the model. However, I've noticed that oob_score_ improves in a similar manner to the in-bag scores (train_score_). That is, it goes down very fast, and keeps improving regardless of the number of iterations. Digging through the code in ensemble/gradient_boosting.py, it seems like the cause is that oob_score_[i] includes previous trees that were trained on the OOB instances of the i-th sample. Isn't the OOB score supposed to be calculated for each OOB instance using only trees that where this instance wasn't used for training (as done for random forests)? Cheers, Yanir -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] OOB score in gradient boosting models
I've opened an issue for this: https://github.com/scikit-learn/scikit-learn/issues/1802 2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de: We should open an issue in the issue tracker. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] OOB score in gradient boosting models
2013/3/22 Yanir Seroussi yanir.serou...@gmail.com: Thanks for the quick response. Good to see that I'm not imagining things :-) Before posting this question, I had a look at Friedman's paper and ESLII and the R gbm documentation, but I couldn't find a clear description of how OOB estimates are calculated. I think it makes sense to have a separate y_oob_pred. I'll probably try fixing it locally over the weekend (unless you beat me to it). I'll let you know how it goes. If you manage to fix it, a PR would be much appreciated! Please keep me posted about your progress. thanks, peter Cheers, Yanir On 22 March 2013 23:27, Peter Prettenhofer peter.prettenho...@gmail.com wrote: I've opened an issue for this: https://github.com/scikit-learn/scikit-learn/issues/1802 2013/3/22 Andreas Mueller amuel...@ais.uni-bonn.de: We should open an issue in the issue tracker. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Why Gaussian Naive Bayes is not working as a base classifier?
Issam, currently, GaussianNB does not support sample weights thus it cannot be used w/ Adaboost. In Weka, if a classifier does not support sample weights they fall back to data set re-sampling. We could implement this strategy as well but it would not be very efficient due to the data structures that we use internally (i.e. numpy arrays). best, Peter 2013/3/7 Issam issamo...@gmail.com: Evening Dear Developers! I'm peculiarly getting an error while using AdaBoostClassifier with GaussianNB() as a a base estimator. These are my commands In [65]: gnb = GaussianNB() In [66]: bdt = AdaBoostClassifier(gnb,n_estimators=100) In [67]: bdt.fit(X,y) I get the following error after executing In[67] : TypeError: fit() got an unexpected keyword argument 'sample_weight' Any reason why I might be getting this? PS: I frequently use adaboosting with Navie Bayes as a base classifier in WEKA, hence the concern :) Thank you very much! Best regards, --Issam Laradji -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] one class svm probability
libsvm does not support probability outputs for one-class SVM. One-class SVM is an algorithm for support estimation (not proper density estimation) - i.e. you get a confidence that P(X) t - where t is somewhat concealed in the nu parameter. 2013/3/5 Lars Buitinck l.j.buiti...@uva.nl: 2013/3/5 Bill Power bill.power...@gmail.com: investigating previous versions i saw that probability was available in version 0.9 with predict_proba and predict_log_proba functions http://scikit-learn.org/0.9/modules/generated/sklearn.svm.OneClassSVM.html but it's not here in the stable version http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html The methods never worked, so they were pruned in a refactoring round. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] How to load data into scikits
Hi David, I recommend that you load the data using Pandas (``pandas.read_csv``). Scikit-learn does not support categorical features out-of-the-box; you need to encode them as dummy variables (aka one-hot encoding) - you can do this either using ``sklearn.preprocessing.DictVectorizer`` or via ``pandas.get_dummies`` . HTH, Peter 2013/2/27 David Montgomery davidmontgom...@gmail.com: Hi, I have a data structure that looks like this: 1 NewYork 1 6 high 0 LA 3 4 low ... I am trying to predict probability where Y is column one. The all of the attributes of the X are categorical and I will use a dtree regression. How do I load this data into the y and X? Thanks -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] How to load data into scikits
2013/2/27 David Montgomery davidmontgom...@gmail.com: Oknow I am really confused on how to interpret the tree. So...I am trying to build a Prob est tree. All of the independent variables are categorical and created dummies. What is throwing me off are the =. I should have a rule that says e.g. if city=LA,NY and TIME=Noon then .20. In the chart I see city=Dubai=.500 What does that mean? city.Dubai = 0.5 means that if the indicator variable city=Dubai is smaller than 0.5 (i.e if city=Dubai is 0) then examples get routed down the left child otherwise they get routed down the right child. What I am trying so see is a chart that I would usually see in SPSS answer tree or SAS etc. since both SPSS and SAS are proprietary I've no clue how they look like So..how do I interpret the city=Dubai=.500? The split node basically asks: is the city feature not Dubai? - if so go down left else right In order to generate rules from decision trees you have to look at a whole path (from root to leaf). Currently, there is no way to extracting rules from decision trees - you have to write your own code that analyzes the tree structure. My aim is to get a node id and to create sql rules to extract data. Unless I am wrong, it appears the the dtree algo is not designed to extract rules and even assign a rule to a node id. Dtrees in scikits are solely for prediction. Is this a fair statement? correct, scikit-learn is mostly a machine learning library; in fact, AFAIK you where the first user to request such a feature. I will be taking the *.dot file not to graph but to somehow parse the file so I can create my rules. better operate on the DecisionTreeRegressor/Classifier.tree_ object. It represents the binary decision tree as a number of parallel arrays; you can find the documentation/code here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 best, Peter Thanks On Wed, Feb 27, 2013 at 11:57 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Looks good to me - save the output to a file (e.g. foobar.dot) and run the following command: $ dot -Tpdf foobar.dot -o foobar.pdf When I open the pdf all labels are correctly displayed - remember that they are not indicator features - so the thresholds are usually country=AU = 0.5. You can find more information here: http://scikit-learn.org/dev/modules/tree.html#classification 2013/2/27 David Montgomery davidmontgom...@gmail.com: Thanks I used DictVectorizer() I am now trying to add lables to the tree graph. Below are the labels and the digraph Tree. However, I dont see lables on the tree nodes. Did I not use feature names correct? measurements = [ {'country':'US','city': 'Dubai'}, {'country':'US','city': 'London'}, {'country':'US','city': 'San Fransisco'}, {'country':'US','city': 'Dubai'}, {'country':'AU','city': 'Mel'}, {'country':'AU','city': 'Sydney'}, {'country':'AU','city': 'Mel'}, {'country':'AU','city': 'Sydney'}, {'country':'AU','city': 'Mel'}, {'country':'AU','city': 'Sydney'}, ] y = [0,0,0,1,1,1,1,1,1,1] vec = DictVectorizer() X = vec.fit_transform(measurements) feature_name = vec.get_feature_names() clf = tree.DecisionTreeRegressor() clf = clf.fit(X.todense(), y) with open(au.dot, 'w') as f: f = tree.export_graphviz(clf, out_file=f,feature_names=feature_name) feature_name = ['city=Dubai', 'city=London', 'city=Mel', 'city=San Fransisco', 'city=Sydney', 'country=AU', 'country=US'] digraph Tree { 0 [label=country=AU = 0.5000\nerror = 2.1\nsamples = 10\nvalue = [ 0.7], shape=box] ; 1 [label=city=Dubai = 0.5000\nerror = 0.75\nsamples = 4\nvalue = [ 0.25], shape=box] ; 0 - 1 ; 2 [label=error = 0.\nsamples = 2\nvalue = [ 0.], shape=box] ; 1 - 2 ; 3 [label=error = 0.5000\nsamples = 2\nvalue = [ 0.5], shape=box] ; 1 - 3 ; 4 [label=error = 0.\nsamples = 6\nvalue = [ 1.], shape=box] ; 0 - 4 ; } On Wed, Feb 27, 2013 at 9:50 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi David, I recommend that you load the data using Pandas (``pandas.read_csv``). Scikit-learn does not support categorical features out-of-the-box; you need to encode them as dummy variables (aka one-hot encoding) - you can do this either using ``sklearn.preprocessing.DictVectorizer`` or via ``pandas.get_dummies`` . HTH, Peter 2013/2/27 David Montgomery davidmontgom...@gmail.com: Hi, I have a data structure that looks like this: 1 NewYork 1 6 high 0 LA 3 4 low ... I am trying to predict probability where Y is column one. The all of the attributes of the X are categorical and I will use a dtree regression. How do I load this data into the y and X? Thanks -- Everyone hates slow websites. So do we
Re: [Scikit-learn-general] exporting/printing boost classifiers weaklearners
Hi, you should look into partial dependence plots [1] - they summarize the effect of certain features on the target response. Currently, our PDPs only support GradientBoostingRegressor/Classifier. [1] http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence best, Peter 2013/2/26 jo...@biociphers.org: Hello, I have been looking for a way to export the boost classifiers. I know that I could print all the trees, but if I have 100 estimators, it starts to be not a good idea. I was thinking on a way to summarize it printing the weaklearners and its weigth. There is an easy way to do that? thanks for all Jordi -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Packaging large objects
@ark: for 500K features and 3K classes your coef_ matrix will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GB Coef_ is stored as a dense matrix - you might get a considerable smaller matrix if you use sparse regularization (keeps most coefficients zero) and convert the coef_ array to a scipy sparse matrix prior to saving the object - this should cut your store costs by a factor of 10-100. To check the sparsity of ``coef_`` use:: sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size) To convert the coef_ array do:: clf = ... # your fitted model clf.coef_ = scipy.sparse.csr_matrix(clf.coef_) Prediction doesn't work currently (raises an Error) when coef_ is a sparse matrix rather than an numpy array - this is a bug in sklearn that should be fixed - I'll submit a PR for this. In the meanwhile please convert back to a numpy array or patch the SGDClassifier.decision_function method (adding ``dense_output=True`` when calling ``safe_sparse_dot`` should do the trick). best, Peter PS: I strongly recommend using sparse regularization (using penatly='l1' or penalty='elasticnet') - this should cut your sparsity significantly. 2013/2/22 Ark 4rk@gmail.com: You could cut that in half by converting coef_ and optionally intercept_ to np.float32 (that's not officially supported, but with the current implementation it should work): clf.coef_ = np.astype(clf.coef_, np.float32) You could also try the HashingVectorizer in sklearn.feature_extraction and see if performance is still acceptable with a small number of features. That also skips storing the vocabulary, which I imagine will be quite large as well. HashingVectorizer might indeed save some space...will test for acceptable answer... (I hope you meant 12000 document *per class*?) :( Unfortunately, no, I have 12000 documents in all..atleast as a start point, Initially it is just to collect metrics, and as time goes on, mode documents per category will be added. Besides I am also limited on train time which seems to go over hour as the number of samples goes up..[My very first attempt was with 200k documents]. Thanks for the suggestions. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Packaging large objects
I just opened a PR for this issue: https://github.com/scikit-learn/scikit-learn/pull/1702 2013/2/22 Peter Prettenhofer peter.prettenho...@gmail.com: @ark: for 500K features and 3K classes your coef_ matrix will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GB Coef_ is stored as a dense matrix - you might get a considerable smaller matrix if you use sparse regularization (keeps most coefficients zero) and convert the coef_ array to a scipy sparse matrix prior to saving the object - this should cut your store costs by a factor of 10-100. To check the sparsity of ``coef_`` use:: sparsity = lambda clf: clf.coef_.nonzero()[1].shape[0] / float(clf.coef_.size) To convert the coef_ array do:: clf = ... # your fitted model clf.coef_ = scipy.sparse.csr_matrix(clf.coef_) Prediction doesn't work currently (raises an Error) when coef_ is a sparse matrix rather than an numpy array - this is a bug in sklearn that should be fixed - I'll submit a PR for this. In the meanwhile please convert back to a numpy array or patch the SGDClassifier.decision_function method (adding ``dense_output=True`` when calling ``safe_sparse_dot`` should do the trick). best, Peter PS: I strongly recommend using sparse regularization (using penatly='l1' or penalty='elasticnet') - this should cut your sparsity significantly. 2013/2/22 Ark 4rk@gmail.com: You could cut that in half by converting coef_ and optionally intercept_ to np.float32 (that's not officially supported, but with the current implementation it should work): clf.coef_ = np.astype(clf.coef_, np.float32) You could also try the HashingVectorizer in sklearn.feature_extraction and see if performance is still acceptable with a small number of features. That also skips storing the vocabulary, which I imagine will be quite large as well. HashingVectorizer might indeed save some space...will test for acceptable answer... (I hope you meant 12000 document *per class*?) :( Unfortunately, no, I have 12000 documents in all..atleast as a start point, Initially it is just to collect metrics, and as time goes on, mode documents per category will be added. Besides I am also limited on train time which seems to go over hour as the number of samples goes up..[My very first attempt was with 200k documents]. Thanks for the suggestions. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Packaging large objects
http://xkcd.com/394/ 2013/2/22 Olivier Grisel olivier.gri...@ensta.org: 2013/2/22 Peter Prettenhofer peter.prettenho...@gmail.com: @ark: for 500K features and 3K classes your coef_ matrix will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GB Nitpicking, that will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GiB or: 50 * 3000 * 8 / 1e6. ~= 12GB But nearly everybody is making the mistake... http://en.wikipedia.org/wiki/Gibibyte -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random forests: Measuring information gain in multi-output
Hi Lukas, the impurity (in your case entropy) is simply averaged over all outputs - see [1] - the code is written in cython (a python dialect that translates to C). best, Peter [1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1482 2013/2/4 Ribonous ribonucle...@gmail.com: I think I understand how a random forest classifier works in the univariate case. Unfortunately I haven't found much information about how to implement random forest classifier in the multi-output case. How does the random forest classifier in sklearn measure the information gain for a given split in the multi-output case ? Can anyone point me to references on this? Also, is the random forest implementation written in Python or another language? Thanks, Lukas -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Using sklearn in Hadoop
Cool example - thanks Nick! 2013/2/4 Robert Kern robert.k...@gmail.com: On Mon, Feb 4, 2013 at 2:50 PM, Nick Pentreath nick.pentre...@gmail.com wrote: @Robert sorry for the delay in responding, I was away on vacation. Here's a link to a gist of a very simple implementation of parallelized SGD using Spark (https://gist.github.com/4707012). It basically replicates the existing Spark logistic regression example, but using sklearn's linear_model module. However the approach used is iterative parameter mixtures (where the local weight vectors are averaged and the resulting weight vector rebroadcast) as opposed to distributed gradient descent (where the local gradients are aggregated, a gradient step taken on the master and the weight vector rebroadcast) - see http://faculty.utpa.edu/reillycf/courses/CSCI6175-F11/papers/nips2010mannetal.pdf for some details. Very cool. Thanks! -- Robert Kern -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] adaptive learning rate?
no - SGDClassifier|SGDRegressor does not support per-feature learning rates 2013/1/28 Ronnie Ghose ronnie.gh...@gmail.com: Is there an adaptive learning rate per feature in sklearn? Ex. --adaptive: use per-feature adaptive learning rates; this is sensible for highly diverse and variable features from https://github.com/JohnLangford/vowpal_wabbit/wiki/Malicious-URL-example -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Using sklearn in Hadoop
Hi Jaganadh, I once used hadoop to implement grid search / multi-task learning with hadoop streaming. The setup was fairly simple: I put the serialized dataset (joblib dump) on HDFS and created an input file - one line for each parameter setting for grid search. The map script deserialized the dataset from HDFS (in the init of the script) and for each map task (=parameter setting) it trained a model, computed the prediction error and emitted it. You can find some of the code here [1]. I used Hadoop because I had a Hadoop cluster at my disposal - nowadays I'd use IPython.parallel and starcluster instead - much simpler IMHO. best, Peter [1] https://github.com/pprett/nut/blob/master/nut/structlearn/dumbomapper.py (this is the mapper script; the code which creates the input files and puts everything onto HDFS is in the auxstrategy.py file) 2013/1/23 JAGANADH G jagana...@gmail.com: Hi All, Does anybody tried using sklearn with Hadoop/Dumbo or hadoop streaming. Please share your thoughts and experience. Best regards -- ** JAGANADH G http://jaganadhg.in ILUGCBE http://ilugcbe.org.in -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!
according to the help the error msg show up when the form creator stopped collecting responses by unchecking Accepting responses in the Form menu (under the Tools menu). [1] [1] http://support.google.com/drive/bin/answer.py?hl=enanswer=1715669 2013/1/22 Mathieu Blondel math...@mblondel.org: The link to the survey doesn't work. Mathieu -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Gradient boosting complexity
2013/1/13 Erik Bernhardsson erikb...@spotify.com: Just a quick question about the gradient boosting in scikit-learn. We have tons of data to regress on (like 100M data points), but the running time of the algorithm is linear in the size of X no matter what subsample is set to. Hi Erik, the problem pertains not to gradient boosting but to our (current) decision tree implementation. We use a bit mask (aka sample_mask) to represent partitions of X. As you said, the algorithm is actually linear in len(X) but only considers rows of X for which sample_mask is True [1] - so ``subsample 0.5`` should run faster than ``subsample == 1.0`` but its slower than passing X_subsample = X[np.random.rand(len(X)) 0.5] directly to the fit method. When the ``sample_mask`` gets too sparse (i.e. too much entries are False) the algorithm spends most time checking the sample_mask -- not very efficient, hence, we use a heuristic to make sure that when the sample_mask gets too sparse (see ``min_density`` parameter) we copy X and discard all rows where sample mask is False [2] - this, however, results in both memory and runtime costs which have to be amortized. Since trees in gradient boosting are usually shallow I decided to turn off this heuristic (see [3]) - please try setting ``self.min_density = 0.1`` and test if you get a performance increase. If ``subsample`` is smaller than ``min_density``, each tree will trigger a copy of X. We (Brian, Gilles, Andy, and me) are not totally happy with our current sample_mask-based tree implementation - personally, I think it can be speed up considerably - but I think removing the sample_mask would require a complete re-write of the tree building procedure. The crux is to represent partitions efficiently while keeping auxiliary data structures (i.e. X_argsorted -- a DS that holds for each feature the list of examples sorted by ascending feature value) in sync. We have discussed various approaches to get rid of our sample_mask approach in this issue [4]. If you want to leverage the whole dataset (100M) you might want to explore a different approach as well: you could take a subsample (100k) and train a GBRT (e.g. 1000k trees) on that; then you can use this GBRT as a non-linear feature detector and augment each of the 100M examples with 1000 new features given by the output of each tree in the GBRT model. Now you can feed the new dataset into a linear model that scales to such a large dataset (e.g. vowpal wabbit). best, Peter [1] see function _smallest_sample_larger_than; https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1826 [2] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L511 [3] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L563 [4] https://github.com/scikit-learn/scikit-learn/issues/964 (closed - discussion continues in https://github.com/scikit-learn/scikit-learn/issues/1435 ) Right now we just sample say 100k data points and run gradient boosting on it, but it would be nice if we can use a much larger data set. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L587 for the code – basically instead of subsampling, the algorithm just creates a random binary mask. It would be nice if it was linear in the len(X) * subsample because then we could set subsample to a very small number and use a lot more data points. That should reduce overfitting with no disadvantages really (afaik). I'm new to gradient boosting and I don't know it that well. Is there a fundamental reason why you can't make it linear in len(X) * subsample? Otherwise I might try to put together a patch for it. Thanks! -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122412 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122412 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Multivariate Adaptive Regression Splines (MARS, aka earth)
2013/1/10 Lars Buitinck l.j.buiti...@uva.nl: 2013/1/10 Jason Rudy ja...@clinicast.net I'm working on an implementation of MARS [1] that I'd like to share, and it seems like sklearn would be a good place for it. The MARS algorithm is currently available as part of the R package earth and is one of the only reasons I still use R. Would sklearn be a good place for such an algorithm? Are there any guidelines or procedures I should be aware of before contributing? I'd love to see MARS in the sklearn - is your implementation currently publicly available? I guess that would fit in scikit-learn, but I'm not an expert on fancy regression analysis. The contributor guidelines can be found here: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md In addition, make sure that (1) you own the code or your employer is ok with you publishing it under BSD license terms, and (2) apparently MARS is a trademark so call the estimator something else, like EarthRegressor or MARegressionSplines. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] GridSearchCV does not work with SGDRegressor
great - thanks Andy! 2013/1/8 Andreas Mueller amuel...@ais.uni-bonn.de: On 01/08/2013 09:57 AM, Andreas Mueller wrote: On 01/08/2013 09:49 AM, Ronnie Ghose wrote: yay :) Sorry, I was to fast. that was not the problem :( D'oh. yes it was. Double d'oh. I need to get some coffee, sorry -- Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS and more. Get SQL Server skills now (including 2012) with LearnDevNow - 200+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only - learn more at: http://p.sf.net/sfu/learnmore_122512 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS and more. Get SQL Server skills now (including 2012) with LearnDevNow - 200+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only - learn more at: http://p.sf.net/sfu/learnmore_122512 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Upgraded jenkins environment for matplotlib testing
thanks! 2012/12/4 Andreas Mueller amuel...@ais.uni-bonn.de: Am 04.12.2012 12:35, schrieb Olivier Grisel: I have updated the virtualenvs of the jenkins vm to use: - ubuntu LTS matplotlib 0.99.1 on python 2.6 - latest stable matplotlib 1.2.0 on python 2.7 Merci beaucoup :) -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Shape of classes_ varies?
I assume this is because they support multiple outputs; lets keep @gilles posted. 2012/11/29 Doug Coleman doug.cole...@gmail.com: I forgot to include the line where I fit clf1. -- Peter Prettenhofer -- Keep yourself connected to Go Parallel: VERIFY Test and improve your parallel project with help from experts and peers. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn
Some more benchmarks from wise.io: http://continuum.io/blog/wiserf-use-cases-and-benchmarks quite impressive indeed - unfortunately I cannot post any comments on the blog - I wonder if they use some sort of binned split evaluation [1] instead of exact split evaluation (wiseRF has slightly lower accuracy scores). Maybe they want to contribute their code :-D best, Peter [1] http://hunch.net/~large_scale_survey/TreeEnsembles.pdf -- Peter Prettenhofer -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn
2012/11/28 Andreas Mueller amuel...@ais.uni-bonn.de: Am 28.11.2012 16:46, schrieb Mathieu Blondel: On Thu, Nov 29, 2012 at 12:33 AM, Andreas Mueller amuel...@ais.uni-bonn.de wrote: Do you see where the sometimes 100x comes from? Not from what he demonstrates, right? scikit-learn is really bad when n_jobs=10. I would be interested in knowing if the performance gains are mostly coming from the fact that wiseRF is written in C++ or if they had to use algorithmic improvements. Why should C++ be any faster than Cython? amongst others: template metaprogramming - see http://lingpipe-blog.com/2011/07/01/why-is-c-so-fast/ if the input data is float64 you need to take conversion to float32 into account; furthermore sklearn will convert to fortran layout - this will give a huge penalty in memory consumption. Templating number of bins in leafs? Maybe they learned a model to pick good default values for the forest for a dataset ;) in terms of algorithms and split point evaluation: different strategies are more appropriate for different feature types (lots vs. few split points); -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn
2012/11/28 Mathieu Blondel math...@mblondel.org: scikit-learn's RF is entirely written in Python (forest.py) so there may still be some slow code paths. Moreover, their parallel implementation is probably written with pthreads or OpenMP so they bypass the problems that we have with Python's multiprocessing module. I think this overhead is marginal - at the end of the day most time is spend on building the trees and there is certainly room for improvement there. -- Peter Prettenhofer -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Problem unpickling 0.11 RF model in 0.12/0.13
Hi Nicolas, unfortunately the two versions are not compatible - we did some modifications (speed enhancements) to the tree module in version 0.12 that break serialization with older versions. The only way to tackle this is to do as Leon proposed: extract the state of the old trees (sklearn.tree.tree.Tree) and create new ones and copy the state (see sklearn.tree._tree.Tree attributes). I haven't done this myself to be honest and I would rather retrain a new RandomForest. sorry for the inconveniences caused. best, Peter 2012/11/20 Leon Palafox leonoe...@gmail.com: I'm not a developer, but a fast ugly solution (well I do not know how fast) would be to do a script that unpacks everything using the old sklearn and repacks it using the new one. Best On Tue, Nov 20, 2012 at 6:23 PM, Fechner, Nikolas nikolas.fech...@novartis.com wrote: Hi all, I've build a random forest model using scikit-learn 0.11 and stored it for subsequent application as a pickled file using the sklearn.externals joblib. Now, I have started looking into a migration scheme to later scitkit-learn versions and noticed that it is apparently not possible to unpickle the stored model using scikit-learn 0.12 or 0.13. This is the error I do get (reproducible on Mac and Linux systems): /Library/Python/2.7/site-packages/scikit_learn-0.12.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/numpy_pickle.pyc in load(filename, mmap_mode) 416 417 try: -- 418 obj = unpickler.load() 419 finally: 420 if hasattr(unpickler, 'file_handle'): /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(self) 856 while 1: 857 key = read(1) -- 858 dispatch[key](self) 859 except _Stop, stopinst: 860 return stopinst.value /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load_global(self) 1088 module = self.readline()[:-1] 1089 name = self.readline()[:-1] - 1090 klass = self.find_class(module, name) 1091 self.append(klass) 1092 dispatch[GLOBAL] = load_global /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in find_class(self, module, name) 1124 __import__(module) 1125 mod = sys.modules[module] - 1126 klass = getattr(mod, name) 1127 return klass 1128 AttributeError: 'module' object has no attribute '_find_best_split' Is this something that could be fixed somehow, and more importantly, is it to be expected that it will be an ongoing problem that loading models built with previous versions cause problems? Many thanks in advance for any comments. Cheers, Nikolas -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Leon Palafox, M.Sc PhD Candidate Iba Laboratory +81-3-5841-8436 University of Tokyo Tokyo, Japan. -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] func with args float / double to C f_f32 / f_f64
2012/11/19 denis denis-bz...@t-online.de: Folks, from a python function with args that may be float or double I want to call a corresponding C functions f_f32 or f_f64. Is there a better way than cython like cdef extern from ...: int f_f32( float* A, float* B ) int f_f64( double* A, double* B ) ... def func_float_or_double( np.ndarray A, np.ndarray B ): assert A.dtype is B.dtype if A.dtype.name == 'float32': return f_f32( A, B ) elif A.dtype.name == 'float64': return f_f64( A, B ) ... Try calling ``f_f64`` with the data buffers of ``A`` and ``B``:: def func_float_or_double( np.ndarray A, np.ndarray B ): assert A.dtype is B.dtype if A.dtype.name == 'float32': return f_f64(np.float32_t*(A.data), np.float32_t*(B.data)) elif A.dtype.name == 'float64': return f_f64(np.float64_t*(A.data), np.float64_t*(B.data)) (This may be more of a cython question, but you sklearn people must do this often ?) thanks, cheers -- denis -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] RandomForest benchmark
Olivier, I tested it with the joblib PR - results got a bit worse. see below best, Peter arcene r py score 0.2700 (0.03) 0.2633 (0.02) train 3.9454 (0.09) 4.6661 (0.20) test0.2199 (0.00) 0.2985 (0.05) landsat r py score 0.0255 (0.00) 0.0552 (0.00) train 2.3184 (0.02) 3.8349 (0.06) test0.1129 (0.00) 0.3513 (0.01) spam r py score 0.0549 (0.00) 0.0664 (0.00) train 1.6380 (0.01) 2.1307 (0.02) test0.0379 (0.00) 0.3311 (0.00) random_gaussian r py score 0.1449 (0.00) 0.1487 (0.01) train 0.3371 (0.01) 1.3574 (0.04) test0.1502 (0.00) 0.3247 (0.05) madelon r py score 0.4061 (0.01) 0.3867 (0.02) train 10.0216 (0.08) 10.4346 (0.08) test0.0980 (0.00) 0.3221 (0.02) 2012/11/17 Olivier Grisel olivier.gri...@ensta.org: You can retry by replacing the sklearn/externals/joblib folder with the joblib folder of this branch: https://github.com/joblib/joblib/pull/44 -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] set target_names / importance of features in a trained model
2012/11/12 paul.czodrow...@merckgroup.com: Dear SciKitters, given an array of (n_samples,n_features) - How do I assign target_names in a concluding step? The target_names are stored in a list and, of course, have the same order as the n_features vector. In a next step, I would like to dump out the importance of the most relevant features? How can this be done in scikit-learn? In particular, I have trained a random forest and would like to dump out the leaves of this RF. Hi Paul, this example shows how to access the feature importance computed by a RF:: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py in order to access the ``feature_importances_`` attribute you have to use the following argument ``compute_importances=True`` . To access the leaves of a decision tree you need to access the ``tree_`` attribute of a DecisionTree - it is basically a collection of parallel arrays that represent the tree (see sklearn.tree._tree). All indices where ``children_left`` and ``children_right`` is -1 are leaves. best, Peter Cheers Thanks, Paul This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Panda / Tree and Random Forest
Didier, what type is ``feature`` (simply print ``type(feature``)? Considering your first email I suspect its a pandas.DataFrame; scikit-learn estimators require array-like inputs - so please do ``clf.fit(features.values, labels.values.ravel())`` instead of ``clf.fit(features, values)``. 15 is quite a lot; but if you just want to fit 5 trees it should run in under 15 seconds (I tested using random data and binary classification). best, Peter 2012/10/24 Didier Vila dv...@capquestco.com: Thanks, I will have a look. Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com -Original Message- From: Andreas Mueller [mailto:amuel...@ais.uni-bonn.de] Sent: 24 October 2012 15:44 To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest As an addition, maybe it would be good for you to have a look into the tutorial: http://scikit-learn.org/dev/tutorial/basic/tutorial.html -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Panda / Tree and Random Forest
2012/10/24 Didier Vila dv...@capquestco.com: Peter, Thanks for the email. I just started to use Panda this morning. Feature are integer ( binary or 0-1-2-3) or real . Note that my target variable is continuous between 0 and 1. Ok - then that's the problem - for regression problems you have to use RandomForestRegressor instead of RandomForestClassifier. best, Peter I just run your code below and I still have the same issue on that. clf.fit(feature.values, label.values.ravel()) Regards Didier Ps: the initial codes worked for 100 samples. Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com -Original Message- From: Peter Prettenhofer [mailto:peter.prettenho...@gmail.com] Sent: 24 October 2012 16:36 To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest Didier, what type is ``feature`` (simply print ``type(feature``)? Considering your first email I suspect its a pandas.DataFrame; scikit-learn estimators require array-like inputs - so please do ``clf.fit(features.values, labels.values.ravel())`` instead of ``clf.fit(features, values)``. 15 is quite a lot; but if you just want to fit 5 trees it should run in under 15 seconds (I tested using random data and binary classification). best, Peter 2012/10/24 Didier Vila dv...@capquestco.com: Thanks, I will have a look. Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: dv...@capquestco.com -Original Message- From: Andreas Mueller [mailto:amuel...@ais.uni-bonn.de] Sent: 24 October 2012 15:44 To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest As an addition, maybe it would be good for you to have a look into the tutorial: http://scikit-learn.org/dev/tutorial/basic/tutorial.html -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general This e-mail is intended solely