Re: [Scikit-learn-general] Packaging large objects

2013-02-25 Thread Lars Buitinck
2013/2/25 Ark <[email protected]>: > Due to a very large number of features(and reduce the size), I use SelectKBest > which selects 150k features from the 500k features that I get from > TfIdfVectorizer, which worked fine. When I use Hashing vectorizer instead of > TfidfVectorizer I see following w

Re: [Scikit-learn-general] Packaging large objects

2013-02-25 Thread Ark
> You could also try the HashingVectorizer in sklearn.feature_extraction > and see if performance is still acceptable with a small number of > features. That also skips storing the vocabulary, which I imagine will > be quite large as well. > Due to a very large number of features(and reduce the si

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Lars Buitinck
2013/2/22 Peter Prettenhofer : > http://xkcd.com/394/ Also http://xkcd.com/1000/ -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- Everyone hates slow websites. So do we. Make your web apps faster

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Andreas Mueller
On 02/22/2013 11:39 AM, Andreas Mueller wrote: > I was just wondering: does the current l1 penalty implementation actually > lead to sparse coef_? > I though additional tricks were required for that. > If it is the case, maybe an example would be nice? > > Oh, ok, the implementation indeed yields s

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer
http://xkcd.com/394/ 2013/2/22 Olivier Grisel : > 2013/2/22 Peter Prettenhofer : >> @ark: for 500K features and 3K classes your coef_ matrix will be: >> 50 * 3000 * 8 / 1024. / 1024. ~= 11GB > > Nitpicking, that will be: > > 50 * 3000 * 8 / 1024. / 1024. ~= 11GiB > > or: > > 50 * 3000

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Olivier Grisel
2013/2/22 Peter Prettenhofer : > @ark: for 500K features and 3K classes your coef_ matrix will be: > 50 * 3000 * 8 / 1024. / 1024. ~= 11GB Nitpicking, that will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GiB or: 50 * 3000 * 8 / 1e6. ~= 12GB But nearly everybody is making the mistake...

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Andreas Mueller
I was just wondering: does the current l1 penalty implementation actually lead to sparse coef_? I though additional tricks were required for that. If it is the case, maybe an example would be nice? On 02/22/2013 11:15 AM, Peter Prettenhofer wrote: > I just opened a PR for this issue: > https://gi

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer
I just opened a PR for this issue: https://github.com/scikit-learn/scikit-learn/pull/1702 2013/2/22 Peter Prettenhofer : > @ark: for 500K features and 3K classes your coef_ matrix will be: > 50 * 3000 * 8 / 1024. / 1024. ~= 11GB > > Coef_ is stored as a dense matrix - you might get a considera

Re: [Scikit-learn-general] Packaging large objects

2013-02-22 Thread Peter Prettenhofer
@ark: for 500K features and 3K classes your coef_ matrix will be: 50 * 3000 * 8 / 1024. / 1024. ~= 11GB Coef_ is stored as a dense matrix - you might get a considerable smaller matrix if you use sparse regularization (keeps most coefficients zero) and convert the coef_ array to a scipy sparse

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > You could cut that in half by converting coef_ and optionally > intercept_ to np.float32 (that's not officially supported, but with > the current implementation it should work): > > clf.coef_ = np.astype(clf.coef_, np.float32) > > You could also try the HashingVectorizer in sklearn.featu

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > btw you could also use a different multiclass strategy like error correcting output codes (exists in sklearn) or a binary tree of classifiers (would have to implement yourself) > Will explore the error-correcting o/p code and binary tree of classifiers. Thanks.. ---

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Lars Buitinck
2013/2/21 Ark <[email protected]>: > Document classification of ~3000 categories with ~12000 documents. > The number of features comes out to be 500,000 [in which case the joblib > classifier dumped is 10g]. If I use SelectKbest to select 20 best > features > the size comes down to 4.8g maint

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread amueller
btw you could also use a different multiclass strategy like error correcting output codes (exists in sklearn) or a binary tree of classifiers (would have to implement yourself) Ark <[email protected]> schrieb: >> >> The size is dominated by the n_features * n_classes coef_ matrix, >> which y

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread amueller
you could try some backward feature selection like recursive feature elimination or just dropping features with neglectible coeficients. group l1 penalty on the weigths would probably be the way to go but we don't have that ... Ark <[email protected]> schrieb: >> >> The size is dominated by

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread amueller
you only need coef_ and intercept_ to make predictions but not much else should be stored. if there is a gain from storing coef yourself it is probably a bug. what is the number of features and classes? Ark <[email protected]> schrieb: >I have been wondering about what makes the size of an S

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > The size is dominated by the n_features * n_classes coef_ matrix, > which you can't get rid of just like that. What does your problem look > like? > Document classification of ~3000 categories with ~12000 documents. The number of features comes out to be 500,000 [in which case the joblib c

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Lars Buitinck
2013/2/21 Ark <[email protected]>: > I have been wondering about what makes the size of an SGD classifier ~10G. > If > only purpose is that of the estimators to predict, is there a way to cut down > on > the attributes that are saved [I was looking to serialize only the necessary > parts if pos

[Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
I have been wondering about what makes the size of an SGD classifier ~10G. If only purpose is that of the estimators to predict, is there a way to cut down on the attributes that are saved [I was looking to serialize only the necessary parts if possible]. Is there a better approach to package