Ok, your earlier post was about synthetic control and this clearly isn't
that. When you run seq2sparse with the TFIDF option, the output vectors
are actually put into <output>/tfidf/vectors/, not <output> or even
<output>/vectors/. I suggest you look at examples/bin/build-reuters.sh.
When you do, you will see that the output file spec of seq2sparse was:
-o ./examples/bin/work/reuters-out-seqdir-sparse
... and notice that the input file spec of kmeans follows the above pattern:
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf/vectors/
On 5/21/10 4:38 PM, Delroy Cameron wrote:
yeah sorry Jeff,
i neglected to say that i am trying to clusters a set of 1400 text documents
from a directory and i'm not using the synthetic dataset. here are the
commands i used to create the vectors
the input data i.e. data/trecdata is a directory of raw text files
i'll run the clustering on the synthetic dataset to see if there is
something wrong with the input vectors.
./mahout seqdirectory
-i /data/trecdata
-o /data/trecdata-seqfiles
-c ascii
-chunk 64
-prefix TREC
and then to create the sparse matrix
./mahout seq2sparse
-s 2
-a org.apache.lucene.analysis.standard.StandardAnalyzer
-chunk 100
-i /home/w007dhc/data/trecdata-seqfiles/chunk-0
-o /home/w007dhc/data/trecdata-vectors
-md 1 -x 75 -wt TFIDF -n 0 -w
-----
--cheers
Delroy