Thank you for the detailed explanation. I think the approach with
the feedback mechanism seems appropriate at this point.
> If you plan to seriously increase the number of documents in your
> corpus you could also try a Rocchio classifier [1] or a k-NN
> classifier. For large text documents colle
.
In order to get a rough estimate I was thinking of a clustering approach
like k means, however since the number of categories might be less than
3000, does this seem to be the correct approach? Or if there a better
solution, would certainly appreciate pointers.
Regards,
Ark
> -or is there a best way to switch to something like knn (which initially
Correction: -or is the best way to switch to something like knn?
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is t
I am using sgdclassifier for document clasification.
where (n_samples, n_features) = (12000, 50).
In my project in some of the cases the category chosen leads to
post-processing the document and again trying to predict, in which case it
should not predict the same category, but return th
se anyone finds this helpful; [and if it
has any cascading effect I failed to notice].
Regards,
Ark
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for pr
writes:
>
> did you see my earlier reply?
>
Ah, you are right, sorry about that...any particular reason we reset the
value?
--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
>
> This is unexpected. Can you inspect the vocabulary_ on both
> vectorizers? Try computing their set.intersection, set.difference,
> set.symmetric_difference (all Python builtins).
>
In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()),
set(vect14.vocabulary_.keys(
Out[17
The vectorized input with the same training data set differs with versions
0.13.1
and 0.14-git.
For:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
smooth_idf=True, sublinear_tf=True, max_df=0.5,
token_pattern=ur'\b(?!\d)\w\w+\b'))
On fit_transf
> You could also try the HashingVectorizer in sklearn.feature_extraction
> and see if performance is still acceptable with a small number of
> features. That also skips storing the vocabulary, which I imagine will
> be quite large as well.
>
Due to a very large number of features(and reduce the si
>
> You could cut that in half by converting coef_ and optionally
> intercept_ to np.float32 (that's not officially supported, but with
> the current implementation it should work):
>
> clf.coef_ = np.astype(clf.coef_, np.float32)
>
> You could also try the HashingVectorizer in sklearn.featu
>
> btw you could also use a different multiclass strategy like error correcting
output codes (exists in sklearn) or a binary tree of classifiers (would have to
implement yourself)
>
Will explore the error-correcting o/p code and binary tree of classifiers.
Thanks..
---
>
> The size is dominated by the n_features * n_classes coef_ matrix,
> which you can't get rid of just like that. What does your problem look
> like?
>
Document classification of ~3000 categories with ~12000 documents.
The number of features comes out to be 500,000 [in which case the joblib
c
I have been wondering about what makes the size of an SGD classifier ~10G. If
only purpose is that of the estimators to predict, is there a way to cut down
on
the attributes that are saved [I was looking to serialize only the necessary
parts if possible]. Is there a better approach to package
>
> Before I forget again , thanks all for explanations and responses :)
>
Oddly enough if I wrap SGDClassifier within OneVsRestClassifier and
set n_jobs=-1 for Ovr classifier, it goes fine, except of course the
compression still returns corrupted file.
-
Before I forget again , thanks all for explanations and responses :)
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with Lea
Matthieu Brucher writes:
>
>
>
> Hi,
> I think Samuel is asking if you are using a 32bits version or a
64bits.Cheers,Matthieu
>
>
Ugh I just realized that after I posted, sorry my mistake..
It seems we are on 64-bit:
$ python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'
Samuel Garcia writes:
>
> 2147483647 = 2**31-1 = 2Go
> Are you running python32 ?
No, I am using python 2.6. The issue with upgrading python is all other
systems depend on the version and would need many code upgrades, maybe we would
do that eventually, but for now we would be on python 2.6
Ronnie Ghose writes:
>
>
> do you think it isn't saving correctly or it isn't loading correctly?
>
I am thinking the issue is with writing length of the compressed zfile, in
the write_zfile of numpy_pickle.py (Although I might be wrong :) ).
[Er, btw bit unrelated, but since I moved t
Gael Varoquaux writes:
>
> On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote:
> Hi, I'm working with Ark on this project. Yes, that's what it looks like
> > - some investigation into this appears to show that either this is a bug
> > in zlib (the l
Ronnie Ghose writes:
>
>
> Any point in adding data redundancy to joblib dumped objects?
>
Ah sorry for not being clear before, the steps in ipython were just to
demonstrate the compression issue not the flow of the code. I used an already
dumped object instead of retraining. [unless I
Hello,
I had been trying to dump a compressed joblib file (which was working fine
about a month ago). Previously I had an issue with amount of memory that
joblib compression took and it seemed that zlib was the issue. But I got more
memory to satisfy the problem.
However when I tried it
category.
> > I also tried downloading the 0.13 version from source and installing it.
This
> > time I see a different error. The steps to reproduce for version 0.13 in
ipython
> > are as follows:
> You can not necessarily load a classifier that was trained with one
> version in another version.
> Could you please provide a minimum code sample to reproduce and open an
> issue on github.
Following the minimalistic code to reproduce the issue (assuming the
classifier is already trained and saved). I will open the issue on github for
the same.
Hello,
I see an issue with predict in case of predicting a text document. [I load
an already trained classifier (OneVsRest(SGDClassifier(loss=log))) using
joblib.load].
Thanks.
In [1]: import sklearn
In [2]: from sklearn.externals import joblib
In [4]: clf = joblib.load("classifier.joblib
>Can you try to turn off IDF normalization using `use_idf=False ` in
>the constructor params of your vectorizer and retry (fit + predict) to
>see if it's related to IDF normalization?
>How many dimensions do you have in your fitted model?
https://gist.github.com/3933727
data_vectors.shape = (10361
e if it's related to IDF normalization?
>
> How many dimensions do you have in your fitted model?
>
> >>> print len(vectorizer.vocabulary_)
>
> How many documents do you have in your training corpus?
>
> How many non-zeros do you have in your transformed document?
>
> >>> print vectorizer.tran
> I don't see the number of non-zeros: could you please do:
>
> >>> print vectorizer.transform([my_text_document])
>
> as I asked previously? The run time should be linear with the number
> of non zeros.
ipdb> print self.vectorizer.transform([doc])
Olivier Grisel writes:
> > https://gist.github.com/3815467
>
> The offending line seems to be:
>
> 11.1931.1937.4737.473 base.py:529(setdiag)
>
> which I don't understand how it could happen at predict time. At fit
> time it could have been:
>
> https://github.com/sci
> >> 7s is very long. How long is your text document in bytes ?
> > The text documents are around 50kB.
>
> That should not take 7s to extract a TF-IDF for a single 50kb
> document. There must be a bug, can you please put a minimalistic code
> snippet + example document that reproduce the issue o
Olivier Grisel writes:
> You can use the Pipeline class to build a compound classifier that
> binds a text feature extractor with a classifier to get a text
> document classifier in the end.
>
Done!
>
> 7s is very long. How long is your text document in bytes ?
The text documents are around
stored vocabulary?]
Ark.
--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http
Olivier Grisel writes:
>
> 2012/9/6 Ark :
> >
> >> Hand how large in bytes? It seems that is should be small enough to be
> >> able to use sklearn.linear_model.LogisticRegression despite the data
> >> copy in memory.
> >>
> >
> > R
> Hand how large in bytes? It seems that is should be small enough to be
> able to use sklearn.linear_model.LogisticRegression despite the data
> copy in memory.
>
Right now its not even 100M, but it will extend to 1G atleast.
Ark writes:
>
>
> > How large (in bytes and in which format)? What are n_samples,
> > n_features and n_classes?
> >
>
> Input data is in the form of paragraphs from English literature
So,
raw data -> Countvectorizer -> test, train set -> sgd.fit ->
> How large (in bytes and in which format)? What are n_samples,
> n_features and n_classes?
>
Input data is in the form of paragraphs from English literature
n_samples=1, n_features=100,000, n_classes=max 100[still collecting data]
What would be the best approach to classify a large dataset with sparse
features, into multiple categories. I referred to the multiclass page in the
sklearn documentation, but was not sure on which one to use for multiclass
probabilities [top n probabilities would be nice].
I tried usin
36 matches
Mail list logo