________________________________
From: Katherine Huang <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Wednesday, January 25, 2012 9:52 PM
Subject: seq2sparse generated dictionary is missing words
I am doing a trial run starting with a sequence file that contains: (this is
from seqdumper and I just made my key the same as my value):
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.Text
Key: first book nature specialword boxes: Value: first book nature specialword
boxes
Key: fourth fake example with fake: Value: fourth fake example with fake
Key: second book fun: Value: second book fun
Key: third unique document item: Value: third unique document item
Key: fifth bag of words: Value: fifth bag of words
Count: 5
When I run
mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o
/khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a
org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv
And I look dump tokenized vectors:
mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000
Did you mean to call vectordump to dump your vectors?
I only have three of my 'orig' documents:
Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: first book nature specialword boxes: Value:
org.apache.mahout.math.VectorWritable@e5d391d
Key: fourth fake example with fake: Value:
org.apache.mahout.math.VectorWritable@e5d391d
Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
Count: 3
In addition, the dictionary is missing words. Is there a reason for this?