Re: fkmeans or Cluster Dumper not working?

Lance Norskog Wed, 27 Jul 2011 01:40:24 -0700

The fix got checked in this afternoon. The problem is that a line in
the shell script surrounds mahout-examples-*.job with quotes. This
makes it not "glob expand the wildcard" to find the actual job file.


look in the bin/mahout shell script, around line 127

On 7/27/11, Jeffrey <[email protected]> wrote:
> erm, is there any workaround to the problem?
>
>
> ----- Original Message -----
>> From: Jeff Eastman <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Cc:
>> Sent: Tuesday, July 26, 2011 1:12 PM
>> Subject: RE: fkmeans or Cluster Dumper not working?
>>
>> Also makes sense that fuzzyk centroids would be completely dense, since
>> every
>> point is a member of every cluster. My reducer heaps are 4G.
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:[email protected]]
>> Sent: Monday, July 25, 2011 2:32 PM
>> To: [email protected]; Jeffrey
>> Subject: RE: fkmeans or Cluster Dumper not working?
>>
>> I'm able to run fuzzyk on your data set with k=10 and k=50 without
>> problems.
>> I also ran it fine with k=100 just to push it a bit harder. Runs took
>> longer as
>> k increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s,
>> 45s,
>> 1m11s). The cluster dumper is throwing an OME with your data points and
>> probably
>> also with the larger cluster volumes, suggesting it needs a larger -Xmx
>> value
>> since it is running locally and not influenced by the cluster vm
>> parameters.
>>
>> I will try some more and keep you updated.
>>
>> The cluster dumper is throwing an OME trying to inhale all your data
>> points. It
>> is running locally
>>
>> -----Original Message-----
>> From: Jeffrey [mailto:[email protected]]
>> Sent: Sunday, July 24, 2011 12:51 AM
>> To: [email protected]
>> Subject: Re: fkmeans or Cluster Dumper not working?
>>
>> Erm, is there any update? is the problem reproducible?
>>
>> Best wishes,
>> Jeffrey04
>>
>>
>>
>>> ________________________________
>>> From: Jeffrey <[email protected]>
>>> To: Jeff Eastman <[email protected]>;
>> "[email protected]" <[email protected]>
>>> Sent: Friday, July 22, 2011 12:40 AM
>>> Subject: Re: fkmeans or Cluster Dumper not working?
>>>
>>>
>>> Hi Jeff,
>>>
>>>
>>> lol, this is probably my last reply before i fall asleep (GMT+8 here).
>>>
>>>
>>> First thing first, data file is here: http://coolsilon.com/image-tag.mvc
>>>
>>>
>>> Q: What is the cardinality of your vector data?
>>> about 1000+ rows (resources) * 14 000+ columns (tags)
>>> Q: Is it sparse or dense?
>>> sparse (assuming sparse = each vector contains mostly 0)
>>> Q: How many vectors are you trying to cluster?
>>> all of them? (1000+ rows)
>>> Q: What is the exact error you see when fkmeans fails with k=10? With
>>> k=50?
>>> i think i posted the exception when k=50, but will post them again here
>>>
>>>
>>> k=10, fkmeans actually works, but cluster dumper returns exception,
>>> however,
>> if i take out --pointsDir, then it would work (output looks ok, but
>> without all
>> the points)
>>>
>>>
>>>     $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>> --overwrite
>> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>>     ...
>>>     $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1
>> --pointsDir sensei/clusters/clusteredPoints --output
>> image-tag-clusters.txt
>> Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>     MAHOUT-JOB:
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>     11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments:
>> {--dictionaryType=text, --endPhase=2147483647,
>> --output=image-tag-clusters.txt,
>> --pointsDir=sensei/clusters/clusteredPoints,
>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>     Exception in thread "main" java.lang.OutOfMemoryError: Java
>> heap space
>>>             at java.lang.Object.clone(Native Method)
>>>             at
>> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>>>             at
>> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>>>             at
>> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>>>             at
>> org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>>>             at
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>>>             at
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>>>             at
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>>>             at
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>>>             at
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>>>             at
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>>>             at
>> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>>>             at
>> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>>>             at
>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>>>             at
>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>             at
>> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>             at
>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>             at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>             at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>             at java.lang.reflect.Method.invoke(Method.java:616)
>>>             at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>             at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>             at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>             at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>             at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>             at java.lang.reflect.Method.invoke(Method.java:616)
>>>             at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>     $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1
>> --output image-tag-clusters.txt Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>     MAHOUT-JOB:
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>     11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments:
>> {--dictionaryType=text, --endPhase=2147483647,
>> --output=image-tag-clusters.txt,
>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>     11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms
>>>
>>>
>>> k=50, fkmeans shows exception after map 100% reduce 0%, and would retry
>>> (map
>> 0% reduce 0%) after the exception
>>>
>>>
>>>     $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>> --overwrite
>> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>>     Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>     MAHOUT-JOB:
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>     11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=sensei/clusters/clusters-0,
>> --convergenceDelta=0.5,
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>
>> --emitMostLikely=false, --endPhase=2147483647,
>> --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10,
>> --method=mapreduce,
>> --numClusters=50, --output=sensei/clusters, --overwrite=null,
>> --startPhase=0,
>> --tempDir=temp, --threshold=0}
>>>     11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
>>>     11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the
>>> native-hadoop
>> library
>>>     11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded &
>> initialized native-zlib library
>>>     11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
>>>     11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
>>>     11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors
>>> to
>> sensei/clusters/clusters-0/part-randomSeed
>>>     11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means
>> Iteration 1
>>>     11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to
>> process : 1
>>>     11/07/22 00:21:26 INFO mapred.JobClient: Running job:
>> job_201107211512_0029
>>>     11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>>>     11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>>>     11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>>>     11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>>>     11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>>>     11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>>>     11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>     11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>>>     11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>>>     11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>>>     11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>>>     11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>>>     11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>>>     11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>>>     11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>>>     11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>>>     11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>>>     11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>>>     11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>>>     11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>>>     11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>>>     11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>>>     11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>>>     11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>>>     11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>>>     11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>     11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>>>     11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>>>     11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>>>     11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>>>     11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>>>     11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>>>     11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>>>     11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>>>     11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>>>     11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>>>     11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>>>     11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>>>     11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>>>     11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>>>     11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>>>     11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>>>     11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>>>     11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>>>     11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>>>     11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>>>     11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>>>     11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>>>     11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>>>     11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>>>     11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>>>     11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>>>     11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>>>     11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>>>     11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>>>     11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>>>     11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>>>     11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>>>     11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>>>     11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>>>     11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>>>     11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>>>     11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>>>     11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>>>     11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>>>     11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>>>     11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>>>     11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>>>     11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>>>     11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>>>     11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>>>     11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>>>     11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>>>     11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>>>     11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>>>     11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>>>     11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>>>     11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>>>     11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>>>     11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>>>     11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>>>     11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>>>     11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>>>     11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>>>     11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>>>     11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>>>     11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>>>     11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>>>     11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>>>     11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>>>     11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>>>     11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>>>     11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>>>     11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>>>     11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>>>     11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>>>     11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>>>     11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>>>     11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>>>     11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>>>     11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>     11/07/22 00:35:07 INFO mapred.JobClient: Task Id :
>> attempt_201107211512_0029_m_000000_0, Status : FAILED
>>>     org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>> any valid local directory for output/file.out
>>>             at
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>             at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>             at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>             at
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>             at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>             at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>             at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>             at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>             at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>             at java.security.AccessController.doPrivileged(Native Method)
>>>             at javax.security.auth.Subject.doAs(Subject.java:416)
>>>             at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>             at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>
>>>
>>>     11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>>>     ...
>>>
>>>
>>> Q: What are the Hadoop heap settings you are using for your job?
>>> I am new to hadoop, not sure where to get those, but got these from
>> localhost:50070, is it right?
>>> 147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB /
>>>
>> 966.69 MB (3%)
>>>
>>>
>>> p/s: i keep forgetting to include my operating environment, sorry. I
>> basically run this in a guest operating system (in a virtualbox virtual
>> machine), assigned 1 CPU core, and 1.5GB of memory. Then the host
>> operating
>> system is OS X 10.6.8 running on alubook (macbook late 2008 model) with
>> 4GB of
>> memory.
>>>
>>>
>>>     $ cat /etc/*-release
>>>     DISTRIB_ID=Ubuntu
>>>     DISTRIB_RELEASE=11.04
>>>     DISTRIB_CODENAME=natty
>>>     DISTRIB_DESCRIPTION="Ubuntu 11.04"
>>>     $ uname -a
>>>     Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC
>>>
>> 2011 i686 i686 i386 GNU/Linux
>>>
>>>
>>> Best wishes,
>>> Jeffrey04
>>>
>>>> ________________________________
>>>> From: Jeff Eastman <[email protected]>
>>>> To: "[email protected]" <[email protected]>;
>> Jeffrey <[email protected]>
>>>> Sent: Thursday, July 21, 2011 11:54 PM
>>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>>>
>>>> Excellent, so this appears to be localized to fuzzyk. Unfortunately, the
>>>>
>> Apache mail server strips off attachments so you'd need another mechanism
>> (a
>> JIRA?) to upload your data if it is not too large. Some more questions in
>> the
>> interim:
>>>>
>>>> - What is the cardinality of your vector data?
>>>> - Is it sparse or dense?
>>>> - How many vectors are you trying to cluster?
>>>> - What is the exact error you see when fkmeans fails with k=10? With
>> k=50?
>>>> - What are the Hadoop heap settings you are using for your job?
>>>>
>>>> -----Original Message-----
>>>> From: Jeffrey [mailto:[email protected]]
>>>> Sent: Thursday, July 21, 2011 11:21 AM
>>>> To: [email protected]
>>>> Subject: Re: fkmeans or Cluster Dumper not
>> working?
>>>>
>>>> Hi Jeff,
>>>>
>>>> Q: Did you change your invocation to specify a different -c directory
>> (e.g. clusters-0)?
>>>> A: Yes :)
>>>>
>>>> Q: Did you add the -cl argument?
>>>> A: Yes :)
>>>>
>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>> --overwrite
>> --emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>> --overwrite
>> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>> --overwrite
>> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>>>
>>>> Q: What is the new CLI invocation for clusterdump?
>>>> A:
>>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4
>> --pointsDir
>> sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>>>
>>>>
>>>> Q: Did this work for -k 10? What happens with -k 50?
>>>> A: works for k=5 (but i don't see the points), but not k=10, fkmeans
>> fails when k=50, so i can't dump when k=50
>>>>
>>>> Q: Have you tried kmeans?
>>>> A: Yes (all tested on 0.6-snapshot)
>>>>
>>>> k=5: no problem :)
>>>> k=10: no problem :)
>>>> k=50: no problem :)
>>>>
>>>> p/s: attached with the test data i used (in mvc format), let me know if
>> you guys prefer raw data in arff format
>>>>
>>>> Best wishes,
>>>> Jeffrey04
>>>>
>>>>
>>>>
>>>>> ________________________________
>>>>> From: Jeff Eastman <[email protected]>
>>>>> To: "[email protected]"
>> <[email protected]>; Jeffrey <[email protected]>
>>>>> Sent: Thursday, July 21, 2011 9:36 PM
>>>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>>>>
>>>>> You are correct, the wiki for fkmeans did not mention the -cl
>> argument. I've added that just now. I think this is what Frank means in
>> his
>> comment but you do *not* have to write any custom code to get the cluster
>> dumper
>> to do what you want, just use the -cl argument and specify clusteredPoints
>> as
>> the -p input to clusterdump.
>>>>>
>>>>> Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These
>> show how to invoke the clustering and cluster dumper from Java at least.
>>>>>
>>>>> Did you change your invocation to specify a different -c directory
>> (e.g. clusters-0)?
>>>>> Did you add the -cl argument?
>>>>> What is the new CLI invocation for clusterdump?
>>>>> Did this work for -k 10? What happens with -k
>> 50?
>>>>> Have you tried kmeans?
>>>>>
>>>>> I can help you better if you will give me answers to my questions
>>>>>
>>>>> -----Original Message-----
>>>>> From: Jeffrey [mailto:[email protected]]
>>>>> Sent: Thursday, July 21, 2011 4:30 AM
>>>>> To: [email protected]
>>>>> Subject: Re: fkmeans or Cluster Dumper not working?
>>>>>
>>>>> Hi again,
>>>>>
>>>>> Let me update on what's working and what's not working.
>>>>>
>>>>> Works:
>>>>> fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>>>> fkmeans clustering (5 clusters)
>>>>> clusterdump (5 clusters) - so points are not included in the
>> clusterdump and I need to write a program for it?
>>>>>
>>>>> Not Working:
>>>>> fkmeans clustering (50 clusters) - same error
>>>>> clusterdump (10
>> clusters) - same error
>>>>>
>>>>>
>>>>> so it seems to attach points to the cluster dumper output like the
>> synthetic control example does, i would have to write some code as pointed
>> by
>> @Frank_Scholten ?
>> https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>>>
>>>>> Best wishes,
>>>>> Jeffrey04
>>>>>
>>>>>> ________________________________
>>>>>> From: Jeff Eastman <[email protected]>
>>>>>> To: "[email protected]"
>> <[email protected]>; Jeffrey <[email protected]>
>>>>>> Sent: Wednesday, July 20, 2011 11:53 PM
>>>>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>>>>>
>>>>>> Hi Jeffrey,
>>>>>>
>>>>>> It is always difficult to debug remotely, but here are some
>> suggestions:
>>>>>> - First, you are specifying both an input clusters directory
>> --clusters and --numClusters clusters so the job is sampling 10 points
>> from your
>> input data set and writing them to clusteredPoints as the prior clusters
>> for the
>> first iteration. You should pick a different name for this directory, as
>> the
>> clusteredPoints directory is used by the -cl (--clustering) option (which
>> you
>> did not supply) to write out the clustered (classified) input vectors.
>> When you
>> subsequently supplied clusteredPoints to the clusterdumper it was
>> expecting a
>> different format and that caused the exception you saw. Change your
>> --clusters
>> directory (clusters-0 is good)
>> and add a -cl argument and things should go more smoothly. The -cl option
>> is not
>> the default and so no clustering of the input points is performed without
>> this
>> (Many people get caught by this and perhaps the default should be changed,
>> but
>> clustering can be expensive and so it is not performed without request).
>>>>>> - If you still have problems, try again with k-means. The
>> similarity to fkmeans is good and it will eliminate fkmeans itself if you
>> see
>> the same problems with k-means
>>>>>> - I don't see why changing the -k argument from 10 to 50
>> should cause any problems, unless your vectors are very large and you are
>> getting an OME in the reducer. Since the reducer is calculating centroid
>> vectors
>> for the next iteration these will become more dense and memory will
>> increase
>> substantially.
>>>>>> - I can't figure out what might be causing your second
>> exception. It is bombing inside of Hadoop file IO and this causes me to
>> suspect
>> command argument
>> problems.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Jeff
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Jeffrey [mailto:[email protected]]
>>>>>> Sent: Wednesday, July 20, 2011 2:41 AM
>>>>>> To: [email protected]
>>>>>> Subject: fkmeans or Cluster Dumper not working?
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to generate clusters using the fkmeans command line
>> tool from my test data. Not sure if this is correct, as it only runs one
>> iteration (output from 0.6-snapshot, gotta use some workaround to some
>> weird bug
>> -
>> http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>>
>> )
>>>>>>
>>>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>> sensei/clusters --clusters sensei/clusteredPoints --maxIter 10
>> --numClusters 10
>> --overwrite --m 5
>>>>>> Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>>
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>>
>> 14:05:18 INFO common.AbstractJob: Command line arguments:
>> {--clusters=sensei/clusteredPoints, --convergenceDelta=0.5,
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>
>> --emitMostLikely=true, --endPhase=2147483647,
>> --input=sensei/image-tag.arff.mvc,
>> --m=5, --maxIter=10, --method=mapreduce, --numClusters=10,
>> --output=sensei/clusters, --overwrite=null, --startPhase=0,
>> --tempDir=temp,
>> --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting
>> sensei/clusters11/07/20
>> 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20
>> 14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library11/07/20
>> 14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>>> loaded & initialized native-zlib library11/07/20 14:05:20
>> INFO compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO
>> compress.CodecPool: Got brand-new decompressor
>>>>>> 11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10
>> vectors to sensei/clusteredPoints/part-randomSeed
>>>>>> 11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy
>> K-Means Iteration 1
>>>>>> 11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths
>> to process : 1
>>>>>> 11/07/20 14:05:30 INFO mapred.JobClient: Running job:
>> job_201107201152_0021
>>>>>> 11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>> 11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>>>> 11/07/20 14:05:57 INFO
>> mapred.JobClient:  map 5% reduce 0%
>>>>>> 11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>>> 11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>>>> 11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>>>> 11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>>>> 11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>>>> 11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>>>> 11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>>>> 11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>>>> 11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>>> 11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>>>> 11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>>>> 11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce
>> 0%
>>>>>> 11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>>>> 11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>>>> 11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>>>> 11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>>>> 11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>>>> 11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>>>> 11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>>>> 11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>>>> 11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>>>> 11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>>>> 11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>>>> 11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>>>> 11/07/20 14:07:13 INFO
>> mapred.JobClient:  map 65% reduce 0%
>>>>>> 11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>>>> 11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>>>> 11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>>>> 11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>>>> 11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>>>> 11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>>>> 11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>>>> 11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>>>> 11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>>>> 11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>>>> 11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>>>> 11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce
>> 0%
>>>>>> 11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>>>> 11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>>>> 11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>> 11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Job complete:
>> job_201107201152_0021
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce
>> tasks=1
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> SLOTS_MILLIS_MAPS=149314
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by
>> all reduces waiting after reserving slots (ms)=0
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by
>> all maps waiting after
>> reserving slots (ms)=0
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Launched map
>> tasks=1
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map
>> tasks=1
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> SLOTS_MILLIS_REDUCES=15618
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format
>> Counters
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Bytes
>> Written=2247222
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Converged
>> Clusters=10
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> FILE_BYTES_READ=130281382
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> HDFS_BYTES_READ=254494
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> FILE_BYTES_WRITTEN=132572666
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=2247222
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format
>> Counters
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input
>> groups=10
>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Map output
>> materialized bytes=2246233
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Combine output
>> records=330
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map input
>> records=1113
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle
>> bytes=2246233
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output
>> records=10
>>>>>> 11/07/20 14:08:32 INFO
>> mapred.JobClient:     Spilled Records=590
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map output
>> bytes=2499995001
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Combine input
>> records=11450
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map output
>> records=11130
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input
>> records=10
>>>>>> 11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096
>> ms
>>>>>>
>>>>>> if I increase the --numClusters argument (e.g. 50), then it will
>> return exception after
>>>>>> 11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>
>>>>>> and would retry again (also reproducible using 0.6-snapshot)
>>>>>>
>>>>>> ...
>>>>>> 11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce
>> 0%
>>>>>> 11/07/20 14:22:30 INFO mapred.JobClient: Task Id :
>> attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
>> find any valid local directory for output/file.out
>>>>>>         at
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>>>         at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>>>         at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>>>         at
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>>>         at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>>>         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>>>         at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>>>         at java.security.AccessController.doPrivileged(Native
>> Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>>>
>>>>>> 11/07/20 14:22:32 INFO
>> mapred.JobClient:  map 0% reduce 0%
>>>>>> ...
>>>>>>
>>>>>> Then I ran cluster dumper to dump information about the
>> clusters, this command would work if I only care about the cluster
>> centroids
>> (both 0.5 release and 0.6-snapshot)
>>>>>>
>>>>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1
>> --output image-tag-clusters.txt
>>>>>> Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>> MAHOUT-JOB:
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>> 11/07/20 14:33:45 INFO common.AbstractJob: Command line
>> arguments: {--dictionaryType=text, --endPhase=2147483647,
>> --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1,
>> --startPhase=0, --tempDir=temp}
>>>>>> 11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761
>> ms
>>>>>>
>>>>>> but if I want to see the degree of membership of each points, I
>> get another exception (yes, reproducible for both 0.5 release and
>> 0.6-snapshot)
>>>>>>
>>>>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1
>> --output image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>>> Running on hadoop, using
>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>> MAHOUT-JOB:
>> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>> 11/07/20 14:35:08 INFO common.AbstractJob: Command line
>> arguments: {--dictionaryType=text, --endPhase=2147483647,
>> --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints,
>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>>>> 11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the
>> native-hadoop
>> library
>>>>>> 11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded
>> & initialized native-zlib library
>>>>>> 11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new
>> decompressor
>>>>>> Exception in thread "main"
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.hadoop.io.IntWritable
>>>>>>         at
>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>>>         at
>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>>>         at
>> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>>>         at
>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>>>>>>
>>    at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>         at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>>>>>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>>>
>>>>>> erm, would writing a short program to call the API (btw,
>> can't seem to find the latest API doc?) be a better choice here? Or did I
>> do
>> anything wrong here (yes, Java is not my main language, and I am very new
>> to
>> Mahout.. and h)?
>>>>>>
>>>>>> the data is converted from an arff file with about 1000 rows
>> (resource) and 14k columns (tag), and it is just a subset of my data.
>> (actually
>> made a mistake so it is now generating resource clusters instead of tag
>> clusters, but I am just doing this as a proof of concept whether mahout is
>> good
>> enough for the task)
>>>>>>
>>>>>> Best
>> wishes,
>>>>>> Jeffrey04
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


-- 
Lance Norskog
[email protected]

Re: fkmeans or Cluster Dumper not working?

Reply via email to