Thanks. When I look into tfidf-vectors/part-00000, it is around 163 bytes. That is what concerns me and i do not know if that is reasonable. The dictionary.file-0 and frequency.file-0 seem to be ok.
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/ Found 7 items drwxr-xr-x - pd supergroup 0 2011-12-20 17:27 /input/vectorized/df-count -rw-r--r-- 1 pd supergroup 226062 2011-12-20 17:26 /input/vectorized/dictionary.file-0 -rw-r--r-- 1 pd supergroup 187713 2011-12-20 17:27 /input/vectorized/frequency.file-0 drwxr-xr-x - pd supergroup 0 2011-12-20 17:26 /input/vectorized/tf-vectors drwxr-xr-x - pd supergroup 0 2011-12-20 17:27 /input/vectorized/tfidf-vectors drwxr-xr-x - pd supergroup 0 2011-12-20 17:25 /input/vectorized/tokenized-documents drwxr-xr-x - pd supergroup 0 2011-12-20 17:26 /input/vectorized/wordcount pd@PeriyaData:~$ pd@PeriyaData:~$ pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tfidf-vectors Found 3 items -rw-r--r-- 1 pd supergroup 0 2011-12-20 17:27 /input/vectorized/tfidf-vectors/_SUCCESS drwxr-xr-x - pd supergroup 0 2011-12-20 17:27 /input/vectorized/tfidf-vectors/_logs -rw-r--r-- 1 pd supergroup 163 2011-12-20 17:27 /input/vectorized/tfidf-vectors/part-r-00000 pd@PeriyaData:~$ Thanks, /PD. On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote: > On 21.12.2011 Periya.Data wrote: > > Though the sequence-file is large, the vector file is relatively small > (163 > > bytes). Is this expected? > > Upon vectorizing your text is split in tokens, for each token the vector > will > contain it's id and an entry indicating how often that token was found in > your > document (weighted by the overall number of documents in your input that > contain > that token). So yes, the resulting vector should be smaller than the > original > document. > > Vectors are stored as Sequence Files - you should be able to just print > their > content and see whether that makes sense. > > Isabel > >
