I am doing a trial run starting with a sequence file that contains: (this is from seqdumper and I just made my key the same as my value):
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text Key: first book nature specialword boxes: Value: first book nature specialword boxes Key: fourth fake example with fake: Value: fourth fake example with fake Key: second book fun: Value: second book fun Key: third unique document item: Value: third unique document item Key: fifth bag of words: Value: fifth bag of words Count: 5 When I run mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o /khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv And I look dump tokenized vectors: mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000 I only have three of my 'orig' documents: Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: first book nature specialword boxes: Value: org.apache.mahout.math.VectorWritable@e5d391d Key: fourth fake example with fake: Value: org.apache.mahout.math.VectorWritable@e5d391d Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d Count: 3 In addition, the dictionary is missing words. Is there a reason for this?
