Re: vectorized output from a sequence file

Periya.Data Wed, 21 Dec 2011 12:21:37 -0800

Thanks. When I look into tfidf-vectors/part-00000, it is around 163 bytes.
That is what concerns me and i do not know if that is reasonable. The
dictionary.file-0 and frequency.file-0 seem to be ok.



pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/
Found 7 items
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/df-count
-rw-r--r--   1 pd supergroup     226062 2011-12-20 17:26
/input/vectorized/dictionary.file-0
-rw-r--r--   1 pd supergroup     187713 2011-12-20 17:27
/input/vectorized/frequency.file-0
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/tf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:25
/input/vectorized/tokenized-documents
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/wordcount
pd@PeriyaData:~$
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tfidf-vectors
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors/_logs
-rw-r--r--   1 pd supergroup        163 2011-12-20 17:27
/input/vectorized/tfidf-vectors/part-r-00000
pd@PeriyaData:~$

Thanks,
/PD.



On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote:

> On 21.12.2011 Periya.Data wrote:
> > Though the sequence-file is large, the vector file is relatively small
> (163
> > bytes). Is this expected?
>
> Upon vectorizing your text is split in tokens, for each token the vector
> will
> contain it's id and an entry indicating how often that token was found in
> your
> document (weighted by the overall number of documents in your input that
> contain
> that token). So yes, the resulting vector should be smaller than the
> original
> document.
>
> Vectors are stored as Sequence Files - you should be able to just print
> their
> content and see whether that makes sense.
>
> Isabel
>
>

Re: vectorized output from a sequence file

Reply via email to