On 21.12.2011 Periya.Data wrote: > Though the sequence-file is large, the vector file is relatively small (163 > bytes). Is this expected?
Upon vectorizing your text is split in tokens, for each token the vector will contain it's id and an entry indicating how often that token was found in your document (weighted by the overall number of documents in your input that contain that token). So yes, the resulting vector should be smaller than the original document. Vectors are stored as Sequence Files - you should be able to just print their content and see whether that makes sense. Isabel
signature.asc
Description: This is a digitally signed message part.
