[scikit-learn] The exact formula used to compute the tf-idf

2020-02-01 Thread Peng Yu
Hi, I am trying to understand the exact formula for tf-idf. vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None) wordtfidf = vectorizer.fit_transform(texts) Given the following 3 documents (id1, id2, id3 are the IDs of the three documents). id1 AA BB BB CC CC CC id2 AA AA AA

[scikit-learn] Incremental generation of tf-idf matrix

2020-01-29 Thread Peng Yu
Hi, I seem that even if there is a slight change in the corpus, I have to run TfidfVectorizer on the whole corpus again. This can be time-consuming especially for large corpora. Is there a way to generate the tf-idf matrix incrementally so that if there is a slight change in the corpus, it will

[scikit-learn] Which sparse matrix should be use for fit?

2020-01-28 Thread Peng Yu
https://scikit-learn.org/stable/modules/svm.html Of the svm classes mentioned above, which sparse matrixes are appropriate to be used with them? https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix It is not very clear what matrix operations

[scikit-learn] How to make sure stop words are matched when lowercase=False?

2020-01-28 Thread Peng Yu
Hi, https://github.com/scikit-learn/scikit-learn/blob/002f891a33b612be389d9c488699db5689753ef4/sklearn/feature_extraction/text.py#L587 The default of lowercase is True. But stopwords are lower case. Where is the code to make sure the stop words are removed when they are not in lower case?

Re: [scikit-learn] Memory efficient TfidfVectorizer

2020-01-28 Thread Peng Yu
> Are you concerned about storing the whole corpus text in memory, or the > whole corpus' statistics? If the text, use input='file' or input='filename' > (or a generator of texts). I am not really sure which stage takes the most memory as my program kills itself due to memory limitation. But I

[scikit-learn] Memory efficient TfidfVectorizer

2020-01-27 Thread Peng Yu
Hi, To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks. -- Regards, Peng

[scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Peng Yu
Hi, I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Is there a way to figure it out? Thanks. -- Regards, Peng

[scikit-learn] a dataset suitable for logistic regression

2017-12-03 Thread Peng Yu
Hi, iris is a three-class dataset. Is there a dataset in sklearn that is suitable for binary classification? Thanks. -- Regards, Peng ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn