On 21.12.2011 Periya.Data wrote:
> Though the sequence-file is large, the vector file is relatively small (163
> bytes). Is this expected?

Upon vectorizing your text is split in tokens, for each token the vector will 
contain it's id and an entry indicating how often that token was found in your 
document (weighted by the overall number of documents in your input that 
contain 
that token). So yes, the resulting vector should be smaller than the original 
document.

Vectors are stored as Sequence Files - you should be able to just print their 
content and see whether that makes sense.

Isabel

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to