One other thing to consider is that you may not need the full millions of
examples to explore the decision space for tuning hyperparameters, choosing
kernels, etc. when building the model. You may try randomly subsampling
(maybe 10k-100k samples is enough? dependent on your dataset) the data and
tr
With millions of samples, LinearSVC or SGDClassifier are more appropriate.
However, they only support the linear kernel. Since you have only 5
features, I think it would be worth trying non-linear features. You can try
the kernel approximation module [1] and PolynomialFeatures [2]
http://scikit-le
To scikit-learn-general,
I am trying to do a binary classification (true/false) of millions of
samples across 5 features with SVM. How many samples should I use for
building my model? I tried using svm.SVC().fit() on hundreds of
thousands of samples, but it ran for more than 12 hours. I am quite n
What about using a distance metric like this one?
http://en.wikipedia.org/wiki/Normalized_Google_distance
From: Joel Nothman [joel.noth...@gmail.com]
Sent: 19 February 2014 22:50
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] Logistic regressio
2014-02-19 20:57 GMT+01:00 Pavel Soriano :
> I thought about using the values of the coefficients of the fitted logit
> equation to get a glimpse of what words in the vocabulary, or what style
> features, affect the most to the classification decision. Is it correct to
> assume that if the coeffici
hi Joseph,
yes I would vote for it. More generally probability calibration has been
on the wish list for some time now. See this old PR that needs some
love:
https://github.com/scikit-learn/scikit-learn/pull/1176
any help on this one too is more than welcome.
Best,
Alex
---