Don't think so. Try "mvn clean install" and let me know what happens.
On 9/30/10 12:48 PM, Matt Tanquary wrote:
Hi Jeff, Thanks for your reply. I just got trunk and started the install. It ended with this error: Error loading supplemental data models: Cannot create file-based resource. org.codehaus.plexus.resource.loader.FileResourceCreationException: Cannot create file-based resource. A lot built, so I went ahead and tried your command-line example, but got: ERROR: Could not find mahout-examples-*.job in /mnt/install/tools/mahout or /mnt/install/tools/mahout/examples/target, please run 'mvn install' to create the .job file I retrieved trunk as follows: svn co http://svn.apache.org/repos/asf/mahout/trunk Then ran 'mvn install' in the trunk folder. Any issues with trunk today? Thanks, Matt On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman <[email protected]> wrote:Hi Matt, From your command arguments, it looks like you are running 0.3. Due to the rate of change in Mahout we recommend you check out trunk and use that instead. With a little tweaking (added a --charset ASCII on seqdirectory) I was able to get as far as you did on trunk but seq2sparse is not what you want to use. The utilities you are using are intended for text preprocessing, to get documents word-counted, into term vector sequenceFiles and then running TF and/or TF-IDF processing on the results to produce VectorWritable sequence files suitable for clustering. For your problem, I suggest you instead look at the Synthetic Control clustering examples, starting with Canopy. These use an InputDriver to process text files containing space-delimited numbers like your data.dat file and produce the VectorWritable sequence files directly. I was able to run this on your data using trunk and it produced 3 clusters. You should be able to run the other synthetic control jobs on it too: CommandLine: ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \ -i data \ -o output \ -t1 3 \ -t2 2 \ -ow \ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure Clusters output: C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]} Weight: Point: 1.0: [22.000, 21.000] C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]} Weight: Point: 1.0: [19.000, 20.000] 1.0: [18.000, 22.000] C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]} Weight: Point: 1.0: [1.000, 3.000] 1.0: [3.000, 2.000] Good hunting, Jeff On 9/29/10 2:26 PM, Matt Tanquary wrote:I was able to run the tutorials, etc. Now I would like to generate my own small test. I have created a data.dat file and put these contents: 22 21 19 20 18 22 1 3 3 2 Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir This created kmeans/seqdir/chunk-o in my dfs with the following content: ź/% /data.dat22 21 19 20 18 22 1 3 3 2 Next I ran: mahout seq2sparse -i kmeans/seqdir -o kmeans/input This generated several things in kmeans/input including the 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000 which contains: řĎân /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable /data.dat@@ It does not seem to have the numeric data at this point. I am hoping someone can shed some light on how I can get my datapoint file into the proper vector format for running mahout kmeans. Just fyi, when I run kmeans against that file (mahout kmeans -i kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2 -w) I get: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) which tells me it was unable to find even 1 vector in the given input folder. Thanks for any comments you provide. -M@
