I am doing a trial run starting with a sequence file that contains: (this is 
from seqdumper and I just made my key the same as my value):

Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.Text
Key: first book nature specialword boxes: Value: first book nature specialword 
boxes
Key: fourth fake example with fake: Value: fourth fake example with fake
Key: second book fun: Value: second book fun
Key: third unique document item: Value: third unique document item
Key: fifth bag of words: Value: fifth bag of words
Count: 5


When I run
mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o 
/khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a 
org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv

And I look dump tokenized vectors:
mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000

I only have three of my 'orig' documents:

Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.mahout.math.VectorWritable
Key: first book nature specialword boxes: Value: 
org.apache.mahout.math.VectorWritable@e5d391d
Key: fourth fake example with fake: Value: 
org.apache.mahout.math.VectorWritable@e5d391d
Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
Count: 3


In addition, the dictionary is missing words. Is there a reason for this?



Reply via email to