2013/2/25 Ark <[email protected]>:
> Due to a very large number of features(and reduce the size), I use SelectKBest
> which selects 150k features from the 500k features that I get from
> TfIdfVectorizer, which worked fine. When I use Hashing vectorizer instead of
> TfidfVectorizer I see following w
> You could also try the HashingVectorizer in sklearn.feature_extraction
> and see if performance is still acceptable with a small number of
> features. That also skips storing the vocabulary, which I imagine will
> be quite large as well.
>
Due to a very large number of features(and reduce the si
2013/2/22 Peter Prettenhofer :
> http://xkcd.com/394/
Also http://xkcd.com/1000/
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
--
Everyone hates slow websites. So do we.
Make your web apps faster
On 02/22/2013 11:39 AM, Andreas Mueller wrote:
> I was just wondering: does the current l1 penalty implementation actually
> lead to sparse coef_?
> I though additional tricks were required for that.
> If it is the case, maybe an example would be nice?
>
>
Oh, ok, the implementation indeed yields s
http://xkcd.com/394/
2013/2/22 Olivier Grisel :
> 2013/2/22 Peter Prettenhofer :
>> @ark: for 500K features and 3K classes your coef_ matrix will be:
>> 50 * 3000 * 8 / 1024. / 1024. ~= 11GB
>
> Nitpicking, that will be:
>
> 50 * 3000 * 8 / 1024. / 1024. ~= 11GiB
>
> or:
>
> 50 * 3000
2013/2/22 Peter Prettenhofer :
> @ark: for 500K features and 3K classes your coef_ matrix will be:
> 50 * 3000 * 8 / 1024. / 1024. ~= 11GB
Nitpicking, that will be:
50 * 3000 * 8 / 1024. / 1024. ~= 11GiB
or:
50 * 3000 * 8 / 1e6. ~= 12GB
But nearly everybody is making the mistake...
I was just wondering: does the current l1 penalty implementation actually
lead to sparse coef_?
I though additional tricks were required for that.
If it is the case, maybe an example would be nice?
On 02/22/2013 11:15 AM, Peter Prettenhofer wrote:
> I just opened a PR for this issue:
> https://gi
I just opened a PR for this issue:
https://github.com/scikit-learn/scikit-learn/pull/1702
2013/2/22 Peter Prettenhofer :
> @ark: for 500K features and 3K classes your coef_ matrix will be:
> 50 * 3000 * 8 / 1024. / 1024. ~= 11GB
>
> Coef_ is stored as a dense matrix - you might get a considera
@ark: for 500K features and 3K classes your coef_ matrix will be:
50 * 3000 * 8 / 1024. / 1024. ~= 11GB
Coef_ is stored as a dense matrix - you might get a considerable
smaller matrix if you use sparse regularization (keeps most
coefficients zero) and convert the coef_ array to a scipy sparse
>
> You could cut that in half by converting coef_ and optionally
> intercept_ to np.float32 (that's not officially supported, but with
> the current implementation it should work):
>
> clf.coef_ = np.astype(clf.coef_, np.float32)
>
> You could also try the HashingVectorizer in sklearn.featu
>
> btw you could also use a different multiclass strategy like error correcting
output codes (exists in sklearn) or a binary tree of classifiers (would have to
implement yourself)
>
Will explore the error-correcting o/p code and binary tree of classifiers.
Thanks..
---
2013/2/21 Ark <[email protected]>:
> Document classification of ~3000 categories with ~12000 documents.
> The number of features comes out to be 500,000 [in which case the joblib
> classifier dumped is 10g]. If I use SelectKbest to select 20 best
> features
> the size comes down to 4.8g maint
btw you could also use a different multiclass strategy like error correcting
output codes (exists in sklearn) or a binary tree of classifiers (would have to
implement yourself)
Ark <[email protected]> schrieb:
>>
>> The size is dominated by the n_features * n_classes coef_ matrix,
>> which y
you could try some backward feature selection like recursive feature
elimination or just dropping features with neglectible coeficients. group l1
penalty on the weigths would probably be the way to go but we don't have that
...
Ark <[email protected]> schrieb:
>>
>> The size is dominated by
you only need coef_ and intercept_ to make predictions but not much else should
be stored. if there is a gain from storing coef yourself it is probably a bug.
what is the number of features and classes?
Ark <[email protected]> schrieb:
>I have been wondering about what makes the size of an S
>
> The size is dominated by the n_features * n_classes coef_ matrix,
> which you can't get rid of just like that. What does your problem look
> like?
>
Document classification of ~3000 categories with ~12000 documents.
The number of features comes out to be 500,000 [in which case the joblib
c
2013/2/21 Ark <[email protected]>:
> I have been wondering about what makes the size of an SGD classifier ~10G.
> If
> only purpose is that of the estimators to predict, is there a way to cut down
> on
> the attributes that are saved [I was looking to serialize only the necessary
> parts if pos
I have been wondering about what makes the size of an SGD classifier ~10G. If
only purpose is that of the estimators to predict, is there a way to cut down
on
the attributes that are saved [I was looking to serialize only the necessary
parts if possible]. Is there a better approach to package
18 matches
Mail list logo