It should be smaller, but this seems a bit too small.

Can you dump the intermediate data containing the untokenized words?  What
about the words after tokenization?

On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote:

> On 21.12.2011 Periya.Data wrote:
> > Though the sequence-file is large, the vector file is relatively small
> (163
> > bytes). Is this expected?
>
> Upon vectorizing your text is split in tokens, for each token the vector
> will
> contain it's id and an entry indicating how often that token was found in
> your
> document (weighted by the overall number of documents in your input that
> contain
> that token). So yes, the resulting vector should be smaller than the
> original
> document.
>
> Vectors are stored as Sequence Files - you should be able to just print
> their
> content and see whether that makes sense.
>
> Isabel
>
>

Reply via email to