Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Andreas Mueller
On 10/05/2015 11:30 AM, Kyle Kastner wrote: > preprocessing was done with straight numpy, and I am 90% sure there is > a more "sklearn approved" way to do it using FeatureUnion, etc. Nope, not really currently. Not nicely. ColumnTransformer is not merged yet. Also OneHotEncoder is currently not i

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Kyle Kastner
I did a piece of that in the Titanic examples from the SciPy tutorial, but it could definitely use a more thorough and clear example. This version could probably be simplified/streamlined - much of my preprocessing was done with straight numpy, and I am 90% sure there is a more "sklearn approved" w

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Andreas Mueller
On 09/30/2015 05:53 PM, KAB wrote: > s. And this is due to the special way scikit-learn requires the data > to be presented to its objects. Last time I checked (I really don't > know if there has been any change since then) one had to do some > wrangling with pandas' data frames, however subtl

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Andreas Mueller
On 10/05/2015 05:59 AM, Dale Smith wrote: > Ah, I should say that the dealing with "too large" data should be referred to > Andreas' tutorial at PyData 2015 (I think that's where I saw it) and the > scikit-learn website. I don't see any reason to repeat it at PyData or PyCon. > If necessary, I

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Dale Smith
5, 2015 8:35 AM To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial Re “The Ins and Outs of a Machine Learning Pipeline — About the data that you feed to a learning algorithm and how to analyze the results”. I'm referencing the proposed list of t

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-10-05 Thread Dale Smith
Suite 400 | Atlanta, GA 30305 -Original Message- From: Sebastian Raschka [mailto:se.rasc...@gmail.com] Sent: Wednesday, September 30, 2015 9:13 PM To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial If there is interest, I could wo

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Jacob Vanderplas
Hi all, Interesting discussion – thanks for the thoughts and ideas! I like the idea of a Data Munging tutorial being separate from an ML tutorial. With only 3 hours, that seems more doable than trying to squeeze it all into one session. Perhaps Stephan's idea of an image-focused ML tutorial could b

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Kyle Kastner
If people are planning to work on this, it would be good to check what Andy and I presented at SciPy, which is based on what Jake and Olivier did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc. etc.). To Sebastian's points - we covered all of these nearly verbatim except perhaps cla

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Chris Waigl
I believe a “Data cleaning and preprocessing for data science” (insert-snappier-title-here) tutorial would be a great addition to a PyCon. It’s a prerequisite for machine learning, that’s sure. A machine learning tutorial should probably not completely sweep it under the carpet, but treat it in

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Sebastian Raschka
If there is interest, I could work on something like “The Ins and Outs of a Machine Learning Pipeline — About the data that you feed to a learning algorithm and how to analyze the results” covering the topics Part 1: - class label encoding - feature encoding - feature selection vs. dimensionali

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread KAB
I agree that data munging is not strictly speaking a machine learning question, i.e. from the mathematics or computational point of view. But there is no denying the fact that most time doing machine learning is actually spent on data munging. So surely dealing with data has something to do with ma

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Sebastian Raschka
I totally agree with Jake. However, I also think that a few general tutorials on “preprocessing” of “clean” datasets (clean in terms of missing values, duplicates, outliers have been dealt with) could be useful to a broader, interdisciplinary audience. For example: - encoding class labels, enco

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Jacob Vanderplas
Hi, The problem with including data munging in the tutorial is that it's not really a machine learning question. Solutions are generally so domain-specific that you can't present it in a way that would be generally useful to an interdisciplinary audience. This is why most (all?) short machine learn

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread KAB
Hello Jake and Andy, If you would not mind some advice, I would suggest including examples (or at least one) where you use data that is not built-in. I remember the first several tutorials (if not all of them) relied completely on built-in data sets and unapologetically ignored the big elephant in

Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial

2015-09-30 Thread Andy
Hi Jake. I think the tutorial Kyle and I did based on the previous tutorials was working quite well. I think it would make sense to work of our scipy ones and improve them further. I'd be happy to work on it. We have some more exercises in a branch, and I have also improved versions of some of