Re: [Scikit-learn-general] Unlabelled and mislabelled data

2013-09-17 Thread Ark
Thank you for the detailed explanation. I think the approach with the feedback mechanism seems appropriate at this point. If you plan to seriously increase the number of documents in your corpus you could also try a Rocchio classifier [1] or a k-NN classifier. For large text documents

[Scikit-learn-general] Unlabelled and mislabelled data

2013-09-11 Thread Ark
. In order to get a rough estimate I was thinking of a clustering approach like k means, however since the number of categories might be less than 3000, does this seem to be the correct approach? Or if there a better solution, would certainly appreciate pointers. Regards, Ark

[Scikit-learn-general] Next best match

2013-05-08 Thread Ark
I am using sgdclassifier for document clasification. where (n_samples, n_features) = (12000, 50). In my project in some of the cases the category chosen leads to post-processing the document and again trying to predict, in which case it should not predict the same category, but return

Re: [Scikit-learn-general] Next best match

2013-05-08 Thread Ark
-or is there a best way to switch to something like knn (which initially Correction: -or is the best way to switch to something like knn? -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the

[Scikit-learn-general] Installing scikit via pip.

2013-05-02 Thread Ark
this helpful; [and if it has any cascading effect I failed to notice]. Regards, Ark -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code

Re: [Scikit-learn-general] Vectorizing input

2013-03-15 Thread Ark
amueller@... writes: did you see my earlier reply? Ah, you are right, sorry about that...any particular reason we reset the value? -- Everyone hates slow websites. So do we. Make your web apps faster with

Re: [Scikit-learn-general] Vectorizing input

2013-03-14 Thread Ark
This is unexpected. Can you inspect the vocabulary_ on both vectorizers? Try computing their set.intersection, set.difference, set.symmetric_difference (all Python builtins). In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()), set(vect14.vocabulary_.keys( Out[17]:

[Scikit-learn-general] Vectorizing input

2013-03-13 Thread Ark
The vectorized input with the same training data set differs with versions 0.13.1 and 0.14-git. For: vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), smooth_idf=True, sublinear_tf=True, max_df=0.5, token_pattern=ur'\b(?!\d)\w\w+\b')) On

Re: [Scikit-learn-general] Packaging large objects

2013-02-25 Thread Ark
You could also try the HashingVectorizer in sklearn.feature_extraction and see if performance is still acceptable with a small number of features. That also skips storing the vocabulary, which I imagine will be quite large as well. Due to a very large number of features(and reduce the size),

[Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
I have been wondering about what makes the size of an SGD classifier ~10G. If only purpose is that of the estimators to predict, is there a way to cut down on the attributes that are saved [I was looking to serialize only the necessary parts if possible]. Is there a better approach to

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
The size is dominated by the n_features * n_classes coef_ matrix, which you can't get rid of just like that. What does your problem look like? Document classification of ~3000 categories with ~12000 documents. The number of features comes out to be 500,000 [in which case the joblib

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
You could cut that in half by converting coef_ and optionally intercept_ to np.float32 (that's not officially supported, but with the current implementation it should work): clf.coef_ = np.astype(clf.coef_, np.float32) You could also try the HashingVectorizer in

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Matthieu Brucher matthieu.brucher@... writes: Hi, I think Samuel is asking if you are using a 32bits version or a 64bits.Cheers,Matthieu Ugh I just realized that after I posted, sorry my mistake.. It seems we are on 64-bit: $ python -c 'import sys;print(%x % sys.maxsize,

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Before I forget again , thanks all for explanations and responses :) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Before I forget again , thanks all for explanations and responses :) Oddly enough if I wrap SGDClassifier within OneVsRestClassifier and set n_jobs=-1 for Ovr classifier, it goes fine, except of course the compression still returns corrupted file.

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Gael Varoquaux gael.varoquaux@... writes: On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote: Hi, I'm working with Ark on this project. Yes, that's what it looks like - some investigation into this appears to show that either this is a bug in zlib (the length returned

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Ronnie Ghose ronnie.ghose@... writes: do you think it isn't saving correctly or it isn't loading correctly? I am thinking the issue is with writing length of the compressed zfile, in the write_zfile of numpy_pickle.py (Although I might be wrong :) ). [Er, btw bit unrelated, but

[Scikit-learn-general] Joblib compressed file error?

2013-01-22 Thread Ark
Hello, I had been trying to dump a compressed joblib file (which was working fine about a month ago). Previously I had an issue with amount of memory that joblib compression took and it seemed that zlib was the issue. But I got more memory to satisfy the problem. However when I tried it

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-22 Thread Ark
Ronnie Ghose ronnie.ghose@... writes: Any point in adding data redundancy to joblib dumped objects? Ah sorry for not being clear before, the steps in ipython were just to demonstrate the compression issue not the flow of the code. I used an already dumped object instead of

Re: [Scikit-learn-general] Ovr Classifier predict error

2013-01-16 Thread Ark
category. I also tried downloading the 0.13 version from source and installing it. This time I see a different error. The steps to reproduce for version 0.13 in ipython are as follows: You can not necessarily load a classifier that was trained with one version in another version. Could

Re: [Scikit-learn-general] Ovr Classifier predict error

2013-01-14 Thread Ark
Could you please provide a minimum code sample to reproduce and open an issue on github. Following the minimalistic code to reproduce the issue (assuming the classifier is already trained and saved). I will open the issue on github for the same.

Re: [Scikit-learn-general] TF-Idf

2012-10-25 Thread Ark
Can you try to turn off IDF normalization using `use_idf=False ` in the constructor params of your vectorizer and retry (fit + predict) to see if it's related to IDF normalization? How many dimensions do you have in your fitted model? https://gist.github.com/3933727 data_vectors.shape = (10361,

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
I don't see the number of non-zeros: could you please do: print vectorizer.transform([my_text_document]) as I asked previously? The run time should be linear with the number of non zeros. ipdb print self.vectorizer.transform([doc]) (0, 687)

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
e if it's related to IDF normalization? How many dimensions do you have in your fitted model? print len(vectorizer.vocabulary_) How many documents do you have in your training corpus? How many non-zeros do you have in your transformed document? print

Re: [Scikit-learn-general] TF-Idf

2012-10-12 Thread Ark
Olivier Grisel olivier.grisel@... writes: https://gist.github.com/3815467 The offending line seems to be: 11.1931.1937.4737.473 base.py:529(setdiag) which I don't understand how it could happen at predict time. At fit time it could have been:

Re: [Scikit-learn-general] TF-Idf

2012-10-01 Thread Ark
7s is very long. How long is your text document in bytes ? The text documents are around 50kB. That should not take 7s to extract a TF-IDF for a single 50kb document. There must be a bug, can you please put a minimalistic code snippet + example document that reproduce the issue on a

[Scikit-learn-general] TF-Idf

2012-09-21 Thread Ark
this stored vocabulary?] Ark. -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-10 Thread Ark
Olivier Grisel olivier.grisel@... writes: 2012/9/6 Ark ark_antos@...: Hand how large in bytes? It seems that is should be small enough to be able to use sklearn.linear_model.LogisticRegression despite the data copy in memory. Right now its not even 100M, but it will extend

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-06 Thread Ark
Hand how large in bytes? It seems that is should be small enough to be able to use sklearn.linear_model.LogisticRegression despite the data copy in memory. Right now its not even 100M, but it will extend to 1G atleast.

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-05 Thread Ark
Ark ark_antos@... writes: How large (in bytes and in which format)? What are n_samples, n_features and n_classes? Input data is in the form of paragraphs from English literature So, raw data - Countvectorizer - test, train set - sgd.fit - predict is the flow. n_samples=1