Hi Matt,
From your command arguments, it looks like you are running 0.3. Due to
the rate of change in Mahout we recommend you check out trunk and use
that instead. With a little tweaking (added a --charset ASCII on
seqdirectory) I was able to get as far as you did on trunk but
seq2sparse is not what you want to use.
The utilities you are using are intended for text preprocessing, to get
documents word-counted, into term vector sequenceFiles and then running
TF and/or TF-IDF processing on the results to produce VectorWritable
sequence files suitable for clustering. For your problem, I suggest you
instead look at the Synthetic Control clustering examples, starting with
Canopy. These use an InputDriver to process text files containing
space-delimited numbers like your data.dat file and produce the
VectorWritable sequence files directly.
I was able to run this on your data using trunk and it produced 3
clusters. You should be able to run the other synthetic control jobs on
it too:
CommandLine:
./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
-i data \
-o output \
-t1 3 \
-t2 2 \
-ow \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
Clusters output:
C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
Weight: Point:
1.0: [22.000, 21.000]
C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
Weight: Point:
1.0: [19.000, 20.000]
1.0: [18.000, 22.000]
C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
Weight: Point:
1.0: [1.000, 3.000]
1.0: [3.000, 2.000]
Good hunting,
Jeff
On 9/29/10 2:26 PM, Matt Tanquary wrote:
I was able to run the tutorials, etc. Now I would like to generate my
own small test.
I have created a data.dat file and put these contents:
22 21
19 20
18 22
1 3
3 2
Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
This created kmeans/seqdir/chunk-o in my dfs with the following content:
¼/%
/data.dat22 21
19 20
18 22
1 3
3 2
Next I ran: mahout seq2sparse -i kmeans/seqdir -o kmeans/input
This generated several things in kmeans/input including the
'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
which contains:
øÏân
/data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
/data.dat@@
It does not seem to have the numeric data at this point.
I am hoping someone can shed some light on how I can get my datapoint
file into the proper vector format for running mahout kmeans.
Just fyi, when I run kmeans against that file (mahout kmeans -i
kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
-w) I get:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
which tells me it was unable to find even 1 vector in the given input folder.
Thanks for any comments you provide.
-M@