Re: [Scikit-learn-general] Unlabelled and mislabelled data

2013-09-17 Thread Ark
Thank you for the detailed explanation. I think the approach with the feedback mechanism seems appropriate at this point. > If you plan to seriously increase the number of documents in your > corpus you could also try a Rocchio classifier [1] or a k-NN > classifier. For large text documents colle

[Scikit-learn-general] Unlabelled and mislabelled data

2013-09-11 Thread Ark
. In order to get a rough estimate I was thinking of a clustering approach like k means, however since the number of categories might be less than 3000, does this seem to be the correct approach? Or if there a better solution, would certainly appreciate pointers. Regards, Ark

Re: [Scikit-learn-general] Next best match

2013-05-08 Thread Ark
> -or is there a best way to switch to something like knn (which initially Correction: -or is the best way to switch to something like knn? -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is t

[Scikit-learn-general] Next best match

2013-05-08 Thread Ark
I am using sgdclassifier for document clasification. where (n_samples, n_features) = (12000, 50). In my project in some of the cases the category chosen leads to post-processing the document and again trying to predict, in which case it should not predict the same category, but return th

[Scikit-learn-general] Installing scikit via pip.

2013-05-02 Thread Ark
se anyone finds this helpful; [and if it has any cascading effect I failed to notice]. Regards, Ark -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for pr

Re: [Scikit-learn-general] Vectorizing input

2013-03-15 Thread Ark
writes: > > did you see my earlier reply? > Ah, you are right, sorry about that...any particular reason we reset the value? -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics

Re: [Scikit-learn-general] Vectorizing input

2013-03-14 Thread Ark
> > This is unexpected. Can you inspect the vocabulary_ on both > vectorizers? Try computing their set.intersection, set.difference, > set.symmetric_difference (all Python builtins). > In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()), set(vect14.vocabulary_.keys( Out[17

[Scikit-learn-general] Vectorizing input

2013-03-13 Thread Ark
The vectorized input with the same training data set differs with versions 0.13.1 and 0.14-git. For: vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), smooth_idf=True, sublinear_tf=True, max_df=0.5, token_pattern=ur'\b(?!\d)\w\w+\b')) On fit_transf

Re: [Scikit-learn-general] Packaging large objects

2013-02-25 Thread Ark
> You could also try the HashingVectorizer in sklearn.feature_extraction > and see if performance is still acceptable with a small number of > features. That also skips storing the vocabulary, which I imagine will > be quite large as well. > Due to a very large number of features(and reduce the si

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > You could cut that in half by converting coef_ and optionally > intercept_ to np.float32 (that's not officially supported, but with > the current implementation it should work): > > clf.coef_ = np.astype(clf.coef_, np.float32) > > You could also try the HashingVectorizer in sklearn.featu

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > btw you could also use a different multiclass strategy like error correcting output codes (exists in sklearn) or a binary tree of classifiers (would have to implement yourself) > Will explore the error-correcting o/p code and binary tree of classifiers. Thanks.. ---

Re: [Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
> > The size is dominated by the n_features * n_classes coef_ matrix, > which you can't get rid of just like that. What does your problem look > like? > Document classification of ~3000 categories with ~12000 documents. The number of features comes out to be 500,000 [in which case the joblib c

[Scikit-learn-general] Packaging large objects

2013-02-21 Thread Ark
I have been wondering about what makes the size of an SGD classifier ~10G. If only purpose is that of the estimators to predict, is there a way to cut down on the attributes that are saved [I was looking to serialize only the necessary parts if possible]. Is there a better approach to package

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
> > Before I forget again , thanks all for explanations and responses :) > Oddly enough if I wrap SGDClassifier within OneVsRestClassifier and set n_jobs=-1 for Ovr classifier, it goes fine, except of course the compression still returns corrupted file. -

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Before I forget again , thanks all for explanations and responses :) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with Lea

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Matthieu Brucher writes: > > > > Hi, > I think Samuel is asking if you are using a 32bits version or a 64bits.Cheers,Matthieu > > Ugh I just realized that after I posted, sorry my mistake.. It seems we are on 64-bit: $ python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-25 Thread Ark
Samuel Garcia writes: > > 2147483647 = 2**31-1 = 2Go > Are you running python32 ? No, I am using python 2.6. The issue with upgrading python is all other systems depend on the version and would need many code upgrades, maybe we would do that eventually, but for now we would be on python 2.6

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Ronnie Ghose writes: > > > do you think it isn't saving correctly or it isn't loading correctly? > I am thinking the issue is with writing length of the compressed zfile, in the write_zfile of numpy_pickle.py (Although I might be wrong :) ). [Er, btw bit unrelated, but since I moved t

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Gael Varoquaux writes: > > On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote: > Hi, I'm working with Ark on this project. Yes, that's what it looks like > > - some investigation into this appears to show that either this is a bug > > in zlib (the l

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-22 Thread Ark
Ronnie Ghose writes: > > > Any point in adding data redundancy to joblib dumped objects? > Ah sorry for not being clear before, the steps in ipython were just to demonstrate the compression issue not the flow of the code. I used an already dumped object instead of retraining. [unless I

[Scikit-learn-general] Joblib compressed file error?

2013-01-22 Thread Ark
Hello, I had been trying to dump a compressed joblib file (which was working fine about a month ago). Previously I had an issue with amount of memory that joblib compression took and it seemed that zlib was the issue. But I got more memory to satisfy the problem. However when I tried it

Re: [Scikit-learn-general] Ovr Classifier predict error

2013-01-16 Thread Ark
category. > > I also tried downloading the 0.13 version from source and installing it. This > > time I see a different error. The steps to reproduce for version 0.13 in ipython > > are as follows: > You can not necessarily load a classifier that was trained with one > version in another version.

Re: [Scikit-learn-general] Ovr Classifier predict error

2013-01-14 Thread Ark
> Could you please provide a minimum code sample to reproduce and open an > issue on github. Following the minimalistic code to reproduce the issue (assuming the classifier is already trained and saved). I will open the issue on github for the same.

[Scikit-learn-general] Ovr Classifier predict error

2013-01-10 Thread Ark
Hello, I see an issue with predict in case of predicting a text document. [I load an already trained classifier (OneVsRest(SGDClassifier(loss=log))) using joblib.load]. Thanks. In [1]: import sklearn In [2]: from sklearn.externals import joblib In [4]: clf = joblib.load("classifier.joblib

Re: [Scikit-learn-general] TF-Idf

2012-10-25 Thread Ark
>Can you try to turn off IDF normalization using `use_idf=False ` in >the constructor params of your vectorizer and retry (fit + predict) to >see if it's related to IDF normalization? >How many dimensions do you have in your fitted model? https://gist.github.com/3933727 data_vectors.shape = (10361

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
e if it's related to IDF normalization? > > How many dimensions do you have in your fitted model? > > >>> print len(vectorizer.vocabulary_) > > How many documents do you have in your training corpus? > > How many non-zeros do you have in your transformed document? > > >>> print vectorizer.tran

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
> I don't see the number of non-zeros: could you please do: > > >>> print vectorizer.transform([my_text_document]) > > as I asked previously? The run time should be linear with the number > of non zeros. ipdb> print self.vectorizer.transform([doc])

Re: [Scikit-learn-general] TF-Idf

2012-10-12 Thread Ark
Olivier Grisel writes: > > https://gist.github.com/3815467 > > The offending line seems to be: > > 11.1931.1937.4737.473 base.py:529(setdiag) > > which I don't understand how it could happen at predict time. At fit > time it could have been: > > https://github.com/sci

Re: [Scikit-learn-general] TF-Idf

2012-10-01 Thread Ark
> >> 7s is very long. How long is your text document in bytes ? > > The text documents are around 50kB. > > That should not take 7s to extract a TF-IDF for a single 50kb > document. There must be a bug, can you please put a minimalistic code > snippet + example document that reproduce the issue o

Re: [Scikit-learn-general] TF-Idf

2012-09-24 Thread Ark
Olivier Grisel writes: > You can use the Pipeline class to build a compound classifier that > binds a text feature extractor with a classifier to get a text > document classifier in the end. > Done! > > 7s is very long. How long is your text document in bytes ? The text documents are around

[Scikit-learn-general] TF-Idf

2012-09-21 Thread Ark
stored vocabulary?] Ark. -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-10 Thread Ark
Olivier Grisel writes: > > 2012/9/6 Ark : > > > >> Hand how large in bytes? It seems that is should be small enough to be > >> able to use sklearn.linear_model.LogisticRegression despite the data > >> copy in memory. > >> > > > > R

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-06 Thread Ark
> Hand how large in bytes? It seems that is should be small enough to be > able to use sklearn.linear_model.LogisticRegression despite the data > copy in memory. > Right now its not even 100M, but it will extend to 1G atleast.

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-05 Thread Ark
Ark writes: > > > > How large (in bytes and in which format)? What are n_samples, > > n_features and n_classes? > > > > Input data is in the form of paragraphs from English literature So, raw data -> Countvectorizer -> test, train set -> sgd.fit ->

Re: [Scikit-learn-general] Multi-class sparse data

2012-09-05 Thread Ark
> How large (in bytes and in which format)? What are n_samples, > n_features and n_classes? > Input data is in the form of paragraphs from English literature n_samples=1, n_features=100,000, n_classes=max 100[still collecting data]

[Scikit-learn-general] Multi-class sparse data

2012-09-05 Thread Ark
What would be the best approach to classify a large dataset with sparse features, into multiple categories. I referred to the multiclass page in the sklearn documentation, but was not sure on which one to use for multiclass probabilities [top n probabilities would be nice]. I tried usin