Re: [Scikit-learn-general] SVM, appropriate size of training set

2014-02-20 Thread Kyle Kastner
One other thing to consider is that you may not need the full millions of examples to explore the decision space for tuning hyperparameters, choosing kernels, etc. when building the model. You may try randomly subsampling (maybe 10k-100k samples is enough? dependent on your dataset) the data and tr

Re: [Scikit-learn-general] SVM, appropriate size of training set

2014-02-20 Thread Mathieu Blondel
With millions of samples, LinearSVC or SGDClassifier are more appropriate. However, they only support the linear kernel. Since you have only 5 features, I think it would be worth trying non-linear features. You can try the kernel approximation module [1] and PolynomialFeatures [2] http://scikit-le

[Scikit-learn-general] SVM, appropriate size of training set

2014-02-20 Thread Tommy Carstensen
To scikit-learn-general, I am trying to do a binary classification (true/false) of millions of samples across 5 features with SVM. How many samples should I use for building my model? I tried using svm.SVC().fit() on hundreds of thousands of samples, but it ran for more than 12 hours. I am quite n

Re: [Scikit-learn-general] Logistic regression coefficients analysis

2014-02-20 Thread Paolo Di Prodi
What about using a distance metric like this one? http://en.wikipedia.org/wiki/Normalized_Google_distance From: Joel Nothman [joel.noth...@gmail.com] Sent: 19 February 2014 22:50 To: scikit-learn-general Subject: Re: [Scikit-learn-general] Logistic regressio

Re: [Scikit-learn-general] Logistic regression coefficients analysis

2014-02-20 Thread Lars Buitinck
2014-02-19 20:57 GMT+01:00 Pavel Soriano : > I thought about using the values of the coefficients of the fitted logit > equation to get a glimpse of what words in the vocabulary, or what style > features, affect the most to the classification decision. Is it correct to > assume that if the coeffici

Re: [Scikit-learn-general] predict_proba for LinearSVC and platt scaling

2014-02-20 Thread Alexandre Gramfort
hi Joseph, yes I would vote for it. More generally probability calibration has been on the wish list for some time now. See this old PR that needs some love: https://github.com/scikit-learn/scikit-learn/pull/1176 any help on this one too is more than welcome. Best, Alex ---