It should be smaller, but this seems a bit too small. Can you dump the intermediate data containing the untokenized words? What about the words after tokenization?
On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote: > On 21.12.2011 Periya.Data wrote: > > Though the sequence-file is large, the vector file is relatively small > (163 > > bytes). Is this expected? > > Upon vectorizing your text is split in tokens, for each token the vector > will > contain it's id and an entry indicating how often that token was found in > your > document (weighted by the overall number of documents in your input that > contain > that token). So yes, the resulting vector should be smaller than the > original > document. > > Vectors are stored as Sequence Files - you should be able to just print > their > content and see whether that makes sense. > > Isabel > >
