Re: vectorized output from a sequence file

Isabel Drost Wed, 21 Dec 2011 11:45:54 -0800

On 21.12.2011 Periya.Data wrote:
> Though the sequence-file is large, the vector file is relatively small (163
> bytes). Is this expected?


Upon vectorizing your text is split in tokens, for each token the vector will 
contain it's id and an entry indicating how often that token was found in your 
document (weighted by the overall number of documents in your input that 
contain 
that token). So yes, the resulting vector should be smaller than the original 
document.

Vectors are stored as Sequence Files - you should be able to just print their 
content and see whether that makes sense.

Isabel

signature.asc
Description: This is a digitally signed message part.

Re: vectorized output from a sequence file

Reply via email to