Re: kmeans vectors

Jeff Eastman Wed, 29 Sep 2010 12:30:20 -0700

 Hi Matt,

From your command arguments, it looks like you are running 0.3. Due tothe rate of change in Mahout we recommend you check out trunk and usethat instead. With a little tweaking (added a --charset ASCII onseqdirectory) I was able to get as far as you did on trunk butseq2sparse is not what you want to use.

The utilities you are using are intended for text preprocessing, to getdocuments word-counted, into term vector sequenceFiles and then runningTF and/or TF-IDF processing on the results to produce VectorWritablesequence files suitable for clustering. For your problem, I suggest youinstead look at the Synthetic Control clustering examples, starting withCanopy. These use an InputDriver to process text files containingspace-delimited numbers like your data.dat file and produce theVectorWritable sequence files directly.

I was able to run this on your data using trunk and it produced 3clusters. You should be able to run the other synthetic control jobs onit too:


CommandLine:
./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
-i data \
-o output \
-t1 3 \
-t2 2 \
-ow \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

Clusters output:
C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
    Weight:  Point:
    1.0: [22.000, 21.000]
C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
    Weight:  Point:
    1.0: [19.000, 20.000]
    1.0: [18.000, 22.000]
C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
    Weight:  Point:
    1.0: [1.000, 3.000]
    1.0: [3.000, 2.000]


Good hunting,
Jeff

On 9/29/10 2:26 PM, Matt Tanquary wrote:

I was able to run the tutorials, etc. Now I would like to generate my
own small test.

I have created a data.dat file and put these contents:
22 21
19 20
18 22
1 3
3 2

Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir

This created kmeans/seqdir/chunk-o in my dfs with the following content:
¼/%
         /data.dat22 21
19 20
18 22
1 3
3 2

Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input

This generated several things in kmeans/input including the
'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
which contains:
øÏân
         /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
      /data.dat@@

It does not seem to have the numeric data at this point.

I am hoping someone can shed some light on how I can get my datapoint
file into the proper vector format for running mahout kmeans.

Just fyi, when I run kmeans against that file (mahout kmeans -i
kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
-w) I get:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
1, Size: 1
         at java.util.ArrayList.RangeCheck(ArrayList.java:547)

which tells me it was unable to find even 1 vector in the given input folder.

Thanks for any comments you provide.
-M@

Re: kmeans vectors

Reply via email to