[scikit-learn] Which sparse matrix should be use for fit?

2020-01-28 Thread Peng Yu
https://scikit-learn.org/stable/modules/svm.html Of the svm classes mentioned above, which sparse matrixes are appropriate to be used with them? https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix It is not very clear what matrix operations

Re: [scikit-learn] How to make sure stop words are matched when lowercase=False?

2020-01-28 Thread Joel Nothman
There is no such code. You need to make sure that the normalisation you use matches the normalisation applied when constructing a stop word list. Unfortunately we do not provide for this directly, and it is not easy to do so in the general case. ___

[scikit-learn] How to make sure stop words are matched when lowercase=False?

2020-01-28 Thread Peng Yu
Hi, https://github.com/scikit-learn/scikit-learn/blob/002f891a33b612be389d9c488699db5689753ef4/sklearn/feature_extraction/text.py#L587 The default of lowercase is True. But stopwords are lower case. Where is the code to make sure the stop words are removed when they are not in lower case?

Re: [scikit-learn] Recommended way of distributing persisted models so they work on different architectures

2020-01-28 Thread Joel Nothman
Yes, ONNX is an appropriate solution when exporting models for prediction. See http://scikit-learn.org/stable/modules/model_persistence.html On Tue, 28 Jan 2020 at 23:03, Christopher.samiullah via scikit-learn < scikit-learn@python.org> wrote: > Dear admins, > > > I recently encountered an issue

[scikit-learn] Recommended way of distributing persisted models so they work on different architectures

2020-01-28 Thread Christopher.samiullah via scikit-learn
Dear admins, > I recently encountered an issue attempting to load a model persisted via > joblib dump on different Python architectures. I wrote up the issue here on > stackoverflow: >

Re: [scikit-learn] Memory efficient TfidfVectorizer

2020-01-28 Thread Peng Yu
> Are you concerned about storing the whole corpus text in memory, or the > whole corpus' statistics? If the text, use input='file' or input='filename' > (or a generator of texts). I am not really sure which stage takes the most memory as my program kills itself due to memory limitation. But I

Re: [scikit-learn] Memory efficient TfidfVectorizer

2020-01-28 Thread Joel Nothman
Are you concerned about storing the whole corpus text in memory, or the whole corpus' statistics? If the text, use input='file' or input='filename' (or a generator of texts). On Tue, 28 Jan 2020 at 18:01, Peng Yu wrote: > Hi, > > To use TfidfVectorizer, the whole corpus must be used into