Thank you for the detailed explanation. I think the approach with
the feedback mechanism seems appropriate at this point.
If you plan to seriously increase the number of documents in your
corpus you could also try a Rocchio classifier [1] or a k-NN
classifier. For large text documents
.
In order to get a rough estimate I was thinking of a clustering approach
like k means, however since the number of categories might be less than
3000, does this seem to be the correct approach? Or if there a better
solution, would certainly appreciate pointers.
Regards,
Ark
I am using sgdclassifier for document clasification.
where (n_samples, n_features) = (12000, 50).
In my project in some of the cases the category chosen leads to
post-processing the document and again trying to predict, in which case it
should not predict the same category, but return
-or is there a best way to switch to something like knn (which initially
Correction: -or is the best way to switch to something like knn?
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the
this helpful; [and if it
has any cascading effect I failed to notice].
Regards,
Ark
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code
amueller@... writes:
did you see my earlier reply?
Ah, you are right, sorry about that...any particular reason we reset the
value?
--
Everyone hates slow websites. So do we.
Make your web apps faster with
This is unexpected. Can you inspect the vocabulary_ on both
vectorizers? Try computing their set.intersection, set.difference,
set.symmetric_difference (all Python builtins).
In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()),
set(vect14.vocabulary_.keys(
Out[17]:
The vectorized input with the same training data set differs with versions
0.13.1
and 0.14-git.
For:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
smooth_idf=True, sublinear_tf=True, max_df=0.5,
token_pattern=ur'\b(?!\d)\w\w+\b'))
On
You could also try the HashingVectorizer in sklearn.feature_extraction
and see if performance is still acceptable with a small number of
features. That also skips storing the vocabulary, which I imagine will
be quite large as well.
Due to a very large number of features(and reduce the size),
I have been wondering about what makes the size of an SGD classifier ~10G. If
only purpose is that of the estimators to predict, is there a way to cut down
on
the attributes that are saved [I was looking to serialize only the necessary
parts if possible]. Is there a better approach to
The size is dominated by the n_features * n_classes coef_ matrix,
which you can't get rid of just like that. What does your problem look
like?
Document classification of ~3000 categories with ~12000 documents.
The number of features comes out to be 500,000 [in which case the joblib
You could cut that in half by converting coef_ and optionally
intercept_ to np.float32 (that's not officially supported, but with
the current implementation it should work):
clf.coef_ = np.astype(clf.coef_, np.float32)
You could also try the HashingVectorizer in
Matthieu Brucher matthieu.brucher@... writes:
Hi,
I think Samuel is asking if you are using a 32bits version or a
64bits.Cheers,Matthieu
Ugh I just realized that after I posted, sorry my mistake..
It seems we are on 64-bit:
$ python -c 'import sys;print(%x % sys.maxsize,
Before I forget again , thanks all for explanations and responses :)
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with
Before I forget again , thanks all for explanations and responses :)
Oddly enough if I wrap SGDClassifier within OneVsRestClassifier and
set n_jobs=-1 for Ovr classifier, it goes fine, except of course the
compression still returns corrupted file.
Gael Varoquaux gael.varoquaux@... writes:
On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote:
Hi, I'm working with Ark on this project. Yes, that's what it looks like
- some investigation into this appears to show that either this is a bug
in zlib (the length returned
Ronnie Ghose ronnie.ghose@... writes:
do you think it isn't saving correctly or it isn't loading correctly?
I am thinking the issue is with writing length of the compressed zfile, in
the write_zfile of numpy_pickle.py (Although I might be wrong :) ).
[Er, btw bit unrelated, but
Hello,
I had been trying to dump a compressed joblib file (which was working fine
about a month ago). Previously I had an issue with amount of memory that
joblib compression took and it seemed that zlib was the issue. But I got more
memory to satisfy the problem.
However when I tried it
Ronnie Ghose ronnie.ghose@... writes:
Any point in adding data redundancy to joblib dumped objects?
Ah sorry for not being clear before, the steps in ipython were just to
demonstrate the compression issue not the flow of the code. I used an already
dumped object instead of
category.
I also tried downloading the 0.13 version from source and installing it.
This
time I see a different error. The steps to reproduce for version 0.13 in
ipython
are as follows:
You can not necessarily load a classifier that was trained with one
version in another version.
Could
Could you please provide a minimum code sample to reproduce and open an
issue on github.
Following the minimalistic code to reproduce the issue (assuming the
classifier is already trained and saved). I will open the issue on github for
the same.
Can you try to turn off IDF normalization using `use_idf=False ` in
the constructor params of your vectorizer and retry (fit + predict) to
see if it's related to IDF normalization?
How many dimensions do you have in your fitted model?
https://gist.github.com/3933727
data_vectors.shape = (10361,
I don't see the number of non-zeros: could you please do:
print vectorizer.transform([my_text_document])
as I asked previously? The run time should be linear with the number
of non zeros.
ipdb print self.vectorizer.transform([doc])
(0, 687)
e if it's related to IDF normalization?
How many dimensions do you have in your fitted model?
print len(vectorizer.vocabulary_)
How many documents do you have in your training corpus?
How many non-zeros do you have in your transformed document?
print
Olivier Grisel olivier.grisel@... writes:
https://gist.github.com/3815467
The offending line seems to be:
11.1931.1937.4737.473 base.py:529(setdiag)
which I don't understand how it could happen at predict time. At fit
time it could have been:
7s is very long. How long is your text document in bytes ?
The text documents are around 50kB.
That should not take 7s to extract a TF-IDF for a single 50kb
document. There must be a bug, can you please put a minimalistic code
snippet + example document that reproduce the issue on a
this stored vocabulary?]
Ark.
--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http
Olivier Grisel olivier.grisel@... writes:
2012/9/6 Ark ark_antos@...:
Hand how large in bytes? It seems that is should be small enough to be
able to use sklearn.linear_model.LogisticRegression despite the data
copy in memory.
Right now its not even 100M, but it will extend
Hand how large in bytes? It seems that is should be small enough to be
able to use sklearn.linear_model.LogisticRegression despite the data
copy in memory.
Right now its not even 100M, but it will extend to 1G atleast.
Ark ark_antos@... writes:
How large (in bytes and in which format)? What are n_samples,
n_features and n_classes?
Input data is in the form of paragraphs from English literature
So,
raw data - Countvectorizer - test, train set - sgd.fit - predict
is the flow.
n_samples=1
30 matches
Mail list logo