Re: kmeans vectors

Jeff Eastman Thu, 30 Sep 2010 10:37:45 -0700

 Don't think so. Try "mvn clean install" and let me know what happens.


On 9/30/10 12:48 PM, Matt Tanquary wrote:

Hi Jeff,

Thanks for your reply. I just got trunk and started the install. It
ended with this error:

Error loading supplemental data models: Cannot create file-based resource.
org.codehaus.plexus.resource.loader.FileResourceCreationException:
Cannot create file-based resource.


A lot built, so I went ahead and tried your command-line example, but got:

ERROR: Could not find mahout-examples-*.job in
/mnt/install/tools/mahout or
/mnt/install/tools/mahout/examples/target, please run 'mvn install' to
create the .job file

I retrieved trunk as follows: svn co
http://svn.apache.org/repos/asf/mahout/trunk

Then ran 'mvn install' in the trunk folder.

Any issues with trunk today?

Thanks,
Matt

On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
<[email protected]>  wrote:

  Hi Matt,

 From your command arguments, it looks like you are running 0.3. Due to the
rate of change in Mahout we recommend you check out trunk and use that
instead. With a little tweaking (added a --charset ASCII on seqdirectory) I
was able to get as far as you did on trunk but seq2sparse is not what you
want to use.

The utilities you are using are intended for text preprocessing, to get
documents word-counted, into term vector sequenceFiles and then running TF
and/or TF-IDF processing on the results to produce VectorWritable sequence
files suitable for clustering. For your problem, I suggest you instead look
at the Synthetic Control clustering examples, starting with Canopy. These
use an InputDriver to process text files containing space-delimited numbers
like your data.dat file and produce the VectorWritable sequence files
directly.

I was able to run this on your data using trunk and it produced 3 clusters.
You should be able to run the other synthetic control jobs on it too:

CommandLine:
./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
-i data \
-o output \
-t1 3 \
-t2 2 \
-ow \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

Clusters output:
C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
    Weight:  Point:
    1.0: [22.000, 21.000]
C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
    Weight:  Point:
    1.0: [19.000, 20.000]
    1.0: [18.000, 22.000]
C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
    Weight:  Point:
    1.0: [1.000, 3.000]
    1.0: [3.000, 2.000]


Good hunting,
Jeff

On 9/29/10 2:26 PM, Matt Tanquary wrote:

I was able to run the tutorials, etc. Now I would like to generate my
own small test.

I have created a data.dat file and put these contents:
22 21
19 20
18 22
1 3
3 2

Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir

This created kmeans/seqdir/chunk-o in my dfs with the following content:
ź/%
         /data.dat22 21
19 20
18 22
1 3
3 2

Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input

This generated several things in kmeans/input including the
'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
which contains:
řĎân
         /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
      /data.dat@@

It does not seem to have the numeric data at this point.

I am hoping someone can shed some light on how I can get my datapoint
file into the proper vector format for running mahout kmeans.

Just fyi, when I run kmeans against that file (mahout kmeans -i
kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
-w) I get:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
1, Size: 1
         at java.util.ArrayList.RangeCheck(ArrayList.java:547)

which tells me it was unable to find even 1 vector in the given input
folder.

Thanks for any comments you provide.
-M@

Re: kmeans vectors

Reply via email to