Hi all,
Interesting discussion – thanks for the thoughts and ideas! I like the idea
of a Data Munging tutorial being separate from an ML tutorial. With only 3
hours, that seems more doable than trying to squeeze it all into one
session. Perhaps Stephan's idea of an image-focused ML tutorial could b
If people are planning to work on this, it would be good to check what
Andy and I presented at SciPy, which is based on what Jake and Olivier
did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc.
etc.).
To Sebastian's points - we covered all of these nearly verbatim except
perhaps cla
I believe a “Data cleaning and preprocessing for data science”
(insert-snappier-title-here) tutorial would be a great addition to a PyCon.
It’s a prerequisite for machine learning, that’s sure. A machine learning
tutorial should probably not completely sweep it under the carpet, but treat it
in
If there is interest, I could work on something like
“The Ins and Outs of a Machine Learning Pipeline — About the data that you feed
to a learning algorithm and how to analyze the results”
covering the topics
Part 1:
- class label encoding
- feature encoding
- feature selection vs. dimensionali
I agree that data munging is not strictly speaking a machine learning
question, i.e. from the mathematics or computational point of view. But
there is no denying the fact that most time doing machine learning is
actually spent on data munging. So surely dealing with data has
something to do with ma
I totally agree with Jake. However, I also think that a few general tutorials
on “preprocessing” of “clean” datasets (clean in terms of missing values,
duplicates, outliers have been dealt with) could be useful to a broader,
interdisciplinary audience. For example:
- encoding class labels, enco
Hi,
The problem with including data munging in the tutorial is that it's not
really a machine learning question. Solutions are generally so
domain-specific that you can't present it in a way that would be generally
useful to an interdisciplinary audience. This is why most (all?) short
machine learn
Hello Jake and Andy,
If you would not mind some advice, I would suggest including examples
(or at least one) where you use data that is not built-in. I remember
the first several tutorials (if not all of them) relied completely on
built-in data sets and unapologetically ignored the big elephant in
Hi Jake.
I think the tutorial Kyle and I did based on the previous tutorials was
working quite well.
I think it would make sense to work of our scipy ones and improve them
further.
I'd be happy to work on it.
We have some more exercises in a branch, and I have also improved
versions of some of
Hi Maryam
Currently, no tree based methods have a partial fit method. We are
currently working on expanding the tree module, you can see our checklist
here; https://github.com/scikit-learn/scikit-learn/issues/5212
There are many methods to reduce the dimensionality of data, if you are
using high
Dear all,
I am using Gradient Boosting Classifier from scikit-learn for a huge set of
data. Unfortunately, the method loads the whole data into memory (around 45
GBs!). As it is not very easy to modify the code to stream data, is there
any other way to make it scalable?
Best Regards,
Maryam Tavak
11 matches
Mail list logo