Re: kmeans vectors

Matt Tanquary Thu, 30 Sep 2010 10:51:14 -0700

Thanks,

It was a permission issue. I had to change the group owner to the
current user's group, it's now building. I  moved the build from one
server to another (which caused the user sync problem).


2010/9/30 Jeff Eastman <[email protected]>:
>  Don't think so. Try "mvn clean install" and let me know what happens.
>
> On 9/30/10 12:48 PM, Matt Tanquary wrote:
>>
>> Hi Jeff,
>>
>> Thanks for your reply. I just got trunk and started the install. It
>> ended with this error:
>>
>> Error loading supplemental data models: Cannot create file-based resource.
>> org.codehaus.plexus.resource.loader.FileResourceCreationException:
>> Cannot create file-based resource.
>>
>>
>> A lot built, so I went ahead and tried your command-line example, but got:
>>
>> ERROR: Could not find mahout-examples-*.job in
>> /mnt/install/tools/mahout or
>> /mnt/install/tools/mahout/examples/target, please run 'mvn install' to
>> create the .job file
>>
>> I retrieved trunk as follows: svn co
>> http://svn.apache.org/repos/asf/mahout/trunk
>>
>> Then ran 'mvn install' in the trunk folder.
>>
>> Any issues with trunk today?
>>
>> Thanks,
>> Matt
>>
>> On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
>> <[email protected]>  wrote:
>>>
>>>  Hi Matt,
>>>
>>>  From your command arguments, it looks like you are running 0.3. Due to
>>> the
>>> rate of change in Mahout we recommend you check out trunk and use that
>>> instead. With a little tweaking (added a --charset ASCII on seqdirectory)
>>> I
>>> was able to get as far as you did on trunk but seq2sparse is not what you
>>> want to use.
>>>
>>> The utilities you are using are intended for text preprocessing, to get
>>> documents word-counted, into term vector sequenceFiles and then running
>>> TF
>>> and/or TF-IDF processing on the results to produce VectorWritable
>>> sequence
>>> files suitable for clustering. For your problem, I suggest you instead
>>> look
>>> at the Synthetic Control clustering examples, starting with Canopy. These
>>> use an InputDriver to process text files containing space-delimited
>>> numbers
>>> like your data.dat file and produce the VectorWritable sequence files
>>> directly.
>>>
>>> I was able to run this on your data using trunk and it produced 3
>>> clusters.
>>> You should be able to run the other synthetic control jobs on it too:
>>>
>>> CommandLine:
>>> ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
>>> -i data \
>>> -o output \
>>> -t1 3 \
>>> -t2 2 \
>>> -ow \
>>> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
>>>
>>> Clusters output:
>>> C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
>>>    Weight:  Point:
>>>    1.0: [22.000, 21.000]
>>> C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
>>>    Weight:  Point:
>>>    1.0: [19.000, 20.000]
>>>    1.0: [18.000, 22.000]
>>> C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
>>>    Weight:  Point:
>>>    1.0: [1.000, 3.000]
>>>    1.0: [3.000, 2.000]
>>>
>>>
>>> Good hunting,
>>> Jeff
>>>
>>> On 9/29/10 2:26 PM, Matt Tanquary wrote:
>>>>
>>>> I was able to run the tutorials, etc. Now I would like to generate my
>>>> own small test.
>>>>
>>>> I have created a data.dat file and put these contents:
>>>> 22 21
>>>> 19 20
>>>> 18 22
>>>> 1 3
>>>> 3 2
>>>>
>>>> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o
>>>> kmeans/seqdir
>>>>
>>>> This created kmeans/seqdir/chunk-o in my dfs with the following content:
>>>> ź/%
>>>>         /data.dat22 21
>>>> 19 20
>>>> 18 22
>>>> 1 3
>>>> 3 2
>>>>
>>>> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>>>>
>>>> This generated several things in kmeans/input including the
>>>> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
>>>> which contains:
>>>> řĎân
>>>>
>>>> /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>>>>      /data.dat@@
>>>>
>>>> It does not seem to have the numeric data at this point.
>>>>
>>>> I am hoping someone can shed some light on how I can get my datapoint
>>>> file into the proper vector format for running mahout kmeans.
>>>>
>>>> Just fyi, when I run kmeans against that file (mahout kmeans -i
>>>> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
>>>> -w) I get:
>>>>
>>>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>>>> 1, Size: 1
>>>>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>>
>>>> which tells me it was unable to find even 1 vector in the given input
>>>> folder.
>>>>
>>>> Thanks for any comments you provide.
>>>> -M@
>>>
>>
>>
>
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Reply via email to