Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Jacob Vanderplas
Hi all, Interesting discussion – thanks for the thoughts and ideas! I like the idea of a Data Munging tutorial being separate from an ML tutorial. With only 3 hours, that seems more doable than trying to squeeze it all into one session. Perhaps Stephan's idea of an image-focused ML tutorial could b

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Kyle Kastner
If people are planning to work on this, it would be good to check what Andy and I presented at SciPy, which is based on what Jake and Olivier did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc. etc.). To Sebastian's points - we covered all of these nearly verbatim except perhaps cla

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Chris Waigl
I believe a “Data cleaning and preprocessing for data science” (insert-snappier-title-here) tutorial would be a great addition to a PyCon. It’s a prerequisite for machine learning, that’s sure. A machine learning tutorial should probably not completely sweep it under the carpet, but treat it in

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Sebastian Raschka
If there is interest, I could work on something like “The Ins and Outs of a Machine Learning Pipeline — About the data that you feed to a learning algorithm and how to analyze the results” covering the topics Part 1: - class label encoding - feature encoding - feature selection vs. dimensionali

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread KAB
I agree that data munging is not strictly speaking a machine learning question, i.e. from the mathematics or computational point of view. But there is no denying the fact that most time doing machine learning is actually spent on data munging. So surely dealing with data has something to do with ma

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Sebastian Raschka
I totally agree with Jake. However, I also think that a few general tutorials on “preprocessing” of “clean” datasets (clean in terms of missing values, duplicates, outliers have been dealt with) could be useful to a broader, interdisciplinary audience. For example: - encoding class labels, enco

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Jacob Vanderplas
Hi, The problem with including data munging in the tutorial is that it's not really a machine learning question. Solutions are generally so domain-specific that you can't present it in a way that would be generally useful to an interdisciplinary audience. This is why most (all?) short machine learn

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread KAB
Hello Jake and Andy, If you would not mind some advice, I would suggest including examples (or at least one) where you use data that is not built-in. I remember the first several tutorials (if not all of them) relied completely on built-in data sets and unapologetically ignored the big elephant in

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Andy
Hi Jake. I think the tutorial Kyle and I did based on the previous tutorials was working quite well. I think it would make sense to work of our scipy ones and improve them further. I'd be happy to work on it. We have some more exercises in a branch, and I have also improved versions of some of

Re: [Scikit-learn-general] Scalability of Gradient Boosting Classifier

2015-09-30 Thread Jacob Schreiber
Hi Maryam Currently, no tree based methods have a partial fit method. We are currently working on expanding the tree module, you can see our checklist here; https://github.com/scikit-learn/scikit-learn/issues/5212 There are many methods to reduce the dimensionality of data, if you are using high

[Scikit-learn-general] Scalability of Gradient Boosting Classifier

2015-09-30 Thread Maryam Tavakol
Dear all, I am using Gradient Boosting Classifier from scikit-learn for a huge set of data. Unfortunately, the method loads the whole data into memory (around 45 GBs!). As it is not very easy to modify the code to stream data, is there any other way to make it scalable? Best Regards, Maryam Tavak