Ani, I really don't understand your second point.
Here is how I view things ... if you can phrase things in those terms, it might help me understand your question. The TF part of TF-IDF refers to the term frequencies in a document. Typically, each possible word is assigned to a positive integer that represents a position in a vector. A term frequency vector is a sparse vector with counts or functions of counts at locations corresponding to the words in a document. If the document has words that were do not have assigned positions in the vector, they are either ignored or the counts are put into a special "UNKNOWN-WORD" position. By definition, there is no way that the term frequency vector can be too long or to short. Likewise, a document's length only matters if the counts get too large to store (completely implausible for this to happen since we use a double). The IDF part of TF-IDF refers to weights that are applied to these TF vectors. These weights are conventionally computed by using the log of the number of documents which have the corresponding word. The IDF weighting has one weight for each position in the term frequency vector and thus length is again not a problem. This is why I don't understand your second point. Is it that you mean that many of the words in the document do not have assigned positions in the term frequency vector? If so, that you means that you didn't analyze the corpus ahead of time to get a good dictionary of word locations. Or is it that you are worried that the counts would be large? On Tue, Dec 3, 2013 at 7:03 AM, Ani Tumanyan <[email protected]> wrote: > Hello everyone, > > I'm working on a project, where I'm trying to extract topics from news > articles. I have around 500,000 articles as a dataset. Here are the steps > that I'm following: > > 1. First of all I'm doing some sort of preprocessing. For this I'm using > Behemoth to annotate the document and get rid of non-English documents, > 2. Then I'm running Mahout's sparse vector command to generate TF-IDF > vectors. The problem with TF-IDF vector is that the number of words for a > document is far more than the number of words in TF vectors. Moreover there > are some words/terms in TF-IDF vector that didn't appear in that specific > document anyway. Is this a correct behaviour or there is something wrong > with my approach? > > Thanks in advance! > > Ani
