Re: fkmeans or Cluster Dumper not working?

Jeffrey Thu, 21 Jul 2011 09:41:25 -0700

Hi Jeff,

lol, this is probably my last reply before i fall asleep (GMT+8 here).


First thing first, data file is here: http://coolsilon.com/image-tag.mvc

Q: What is the cardinality of your vector data?
about 1000+ rows (resources) * 14 000+ columns (tags)
Q: Is it sparse or dense?
sparse (assuming sparse = each vector contains mostly 0)
Q: How many vectors are you trying to cluster?
all of them? (1000+ rows)
Q: What is the exact error you see when fkmeans fails with k=10? With k=50?
i think i posted the exception when k=50, but will post them again here

k=10, fkmeans actually works, but cluster dumper returns exception, however, if 
i take out --pointsDir, then it would work (output looks ok, but without all 
the points)

    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
--emitMostLikely false --numClusters 10 --maxIter 10 --m 5
    ...
    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
--pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt 
Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
    MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
    11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--pointsDir=sensei/clusters/clusteredPoints, 
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.lang.Object.clone(Native Method)
            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
            at 
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
            at 
org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
            at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
            at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
            at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
            at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
            at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
            at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
            at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
            at 
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
            at 
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
            at 
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
            at 
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
            at 
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:616)
            at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
            at 
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
            at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:616)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
image-tag-clusters.txt Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
    MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
    11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: 
{--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
    11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms

k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map 0% 
reduce 0%) after the exception

    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
--emitMostLikely false --numClusters 50 --maxIter 10 --m 5
    Running on hadoop, using 
HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
    MAHOUT-JOB: 
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
    11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: 
{--clustering=null, --clusters=sensei/clusters/clusters-0, 
--convergenceDelta=0.5, 
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
 --emitMostLikely=false, --endPhase=2147483647, 
--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
--numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
--tempDir=temp, --threshold=0}
    11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
    11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop 
library
    11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
    11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
    11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
    11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to 
sensei/clusters/clusters-0/part-randomSeed
    11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
Iteration 1
    11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to process 
: 1
    11/07/22 00:21:26 INFO mapred.JobClient: Running job: job_201107211512_0029
    11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
    11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
    11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
    11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
    11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
    11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
    11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
    11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
    11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
    11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
    11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
    11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
    11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
    11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
    11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
    11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
    11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
    11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
    11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
    11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
    11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
    11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
    11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
    11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
    11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
    11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
    11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
    11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
    11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
    11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
    11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
    11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
    11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
    11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
    11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
    11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
    11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
    11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
    11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
    11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
    11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
    11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
    11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
    11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
    11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
    11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
    11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
    11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
    11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
    11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
    11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
    11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
    11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
    11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
    11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
    11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
    11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
    11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
    11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
    11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
    11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
    11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
    11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
    11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
    11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
    11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
    11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
    11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
    11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
    11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
    11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
    11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
    11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
    11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
    11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
    11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
    11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
    11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
    11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
    11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
    11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
    11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
    11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
    11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
    11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
    11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
    11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
    11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
    11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
    11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
    11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
    11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
    11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
    11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
    11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
    11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
    11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
    11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
    11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
    11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
    11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
    11/07/22 00:35:07 INFO mapred.JobClient: Task Id : 
attempt_201107211512_0029_m_000000_0, Status : FAILED
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
valid local directory for output/file.out
            at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
            at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
            at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
            at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
            at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
            at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
            at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:416)
            at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
            at org.apache.hadoop.mapred.Child.main(Child.java:253)

    11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
    ...

Q: What are the Hadoop heap settings you are using for your job?
I am new to hadoop, not sure where to get those, but got these from 
localhost:50070, is it right?
147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / 
966.69 MB (3%)

p/s: i keep forgetting to include my operating environment, sorry. I basically 
run this in a guest operating system (in a virtualbox virtual machine), 
assigned 1 CPU core, and 1.5GB of memory. Then the host operating system is OS 
X 10.6.8 running on alubook (macbook late 2008 model) with 4GB of memory.

    $ cat /etc/*-release
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=11.04
    DISTRIB_CODENAME=natty
    DISTRIB_DESCRIPTION="Ubuntu 11.04"
    $ uname -a
    Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC 2011 
i686 i686 i386 GNU/Linux

Best wishes,
Jeffrey04

>________________________________
>From: Jeff Eastman <[email protected]>
>To: "[email protected]" <[email protected]>; Jeffrey 
><[email protected]>
>Sent: Thursday, July 21, 2011 11:54 PM
>Subject: RE: fkmeans or Cluster Dumper not working?
>
>Excellent, so this appears to be localized to fuzzyk. Unfortunately, the 
>Apache mail server strips off attachments so you'd need another mechanism (a 
>JIRA?) to upload your data if it is not too large. Some more questions in the 
>interim:
>
>- What is the cardinality of your vector data?
>- Is it sparse or dense?
>- How many vectors are you trying to cluster?
>- What is the exact error you see when fkmeans fails with k=10? With k=50?
>- What are the Hadoop heap settings you are using for your job?
>
>-----Original Message-----
>From: Jeffrey [mailto:[email protected]]
>Sent: Thursday, July 21, 2011 11:21 AM
>To: [email protected]
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>Hi Jeff,
>
>Q: Did you change your invocation to specify a different -c directory (e.g. 
>clusters-0)?
>A: Yes :)
>
>Q: Did you add the -cl argument?
>A: Yes :)
>
>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
>--emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
>--emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
>--emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>
>Q: What is the new CLI invocation for clusterdump?
>A:
>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir 
>sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>
>
>Q: Did this work for -k 10? What happens with -k 50?
>A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails 
>when k=50, so i can't dump when k=50
>
>Q: Have you tried kmeans?
>A: Yes (all tested on 0.6-snapshot)
>
>k=5: no problem :)
>k=10: no problem :)
>k=50: no problem :)
>
>p/s: attached with the test data i used (in mvc format), let me know if you 
>guys prefer raw data in arff format
>
>Best wishes,
>Jeffrey04
>
>
>
>>________________________________
>>From: Jeff Eastman <[email protected]>
>>To: "[email protected]" <[email protected]>; Jeffrey 
>><[email protected]>
>>Sent: Thursday, July 21, 2011 9:36 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>You are correct, the wiki for fkmeans did not mention the -cl argument. I've 
>>added that just now. I think this is what Frank means in his comment but you 
>>do *not* have to write any custom code to get the cluster dumper to do what 
>>you want, just use the -cl argument and specify clusteredPoints as the -p 
>>input to clusterdump.
>>
>>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how 
>>to invoke the clustering and cluster dumper from Java at least.
>>
>>Did you change your invocation to specify a different -c directory (e.g. 
>>clusters-0)?
>>Did you add the -cl argument?
>>What is the new CLI invocation for clusterdump?
>>Did this work for -k 10? What happens with -k 50?
>>Have you tried kmeans?
>>
>>I can help you better if you will give me answers to my questions
>>
>>-----Original Message-----
>>From: Jeffrey [mailto:[email protected]]
>>Sent: Thursday, July 21, 2011 4:30 AM
>>To: [email protected]
>>Subject: Re: fkmeans or Cluster Dumper not working?
>>
>>Hi again,
>>
>>Let me update on what's working and what's not working.
>>
>>Works:
>>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>fkmeans clustering (5 clusters)
>>clusterdump (5 clusters) - so points are not included in the clusterdump and 
>>I need to write a program for it?
>>
>>Not Working:
>>fkmeans clustering (50 clusters) - same error
>>clusterdump (10 clusters) - same error
>>
>>
>>so it seems to attach points to the cluster dumper output like the synthetic 
>>control example does, i would have to write some code as pointed by 
>>@Frank_Scholten ? 
>>https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>
>>Best wishes,
>>Jeffrey04
>>
>>>________________________________
>>>From: Jeff Eastman <[email protected]>
>>>To: "[email protected]" <[email protected]>; Jeffrey 
>>><[email protected]>
>>>Sent: Wednesday, July 20, 2011 11:53 PM
>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>
>>>Hi Jeffrey,
>>>
>>>It is always difficult to debug remotely, but here are some suggestions:
>>>- First, you are specifying both an input clusters directory --clusters and 
>>>--numClusters clusters so the job is sampling 10 points from your input data 
>>>set and writing them to clusteredPoints as the prior clusters for the first 
>>>iteration. You should pick a different name for this directory, as the 
>>>clusteredPoints directory is used by the -cl (--clustering) option (which 
>>>you did not supply) to write out the clustered (classified) input vectors. 
>>>When you subsequently supplied clusteredPoints to the clusterdumper it was 
>>>expecting a different format and that caused the exception you saw. Change 
>>>your --clusters directory (clusters-0 is good) and add a -cl argument and 
>>>things should go more smoothly. The -cl option is not the default and so no 
>>>clustering of the input points is performed without this (Many people get 
>>>caught by this and perhaps the default should be changed, but clustering can 
>>>be expensive and so it is not performed without request).
>>>- If you still have problems, try again with k-means. The similarity to 
>>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>>problems with k-means
>>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>>problems, unless your vectors are very large and you are getting an OME in 
>>>the reducer. Since the reducer is calculating centroid vectors for the next 
>>>iteration these will become more dense and memory will increase 
>>>substantially.
>>>- I can't figure out what might be causing your second exception. It is 
>>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>>argument problems.
>>>
>>>Hope this helps,
>>>Jeff
>>>
>>>
>>>-----Original Message-----
>>>From: Jeffrey [mailto:[email protected]]
>>>Sent: Wednesday, July 20, 2011 2:41 AM
>>>To: [email protected]
>>>Subject: fkmeans or Cluster Dumper not working?
>>>
>>>Hi,
>>>
>>>I am trying to generate clusters using the fkmeans command line tool from my 
>>>test data. Not sure if this is correct, as it only runs one iteration 
>>>(output from 0.6-snapshot, gotta use some workaround to some weird bug - 
>>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>>> )
>>>
>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 
>>>10 --overwrite --m 5
>>>Running on hadoop, using 
>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>>> 
>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>> --emitMostLikely=true, --endPhase=2147483647, 
>>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>>>--numClusters=10, --output=sensei/clusters, --overwrite=null, 
>>>--startPhase=0, --tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO 
>>>common.HadoopUtil: Deleting sensei/clusters11/07/20 14:05:20 INFO 
>>>common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 14:05:20 INFO 
>>>util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 14:05:20 
>>>INFO zlib.ZlibFactory: Successfully
>>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>>>compress.CodecPool: Got brand-new decompressor
>>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>>>sensei/clusteredPoints/part-randomSeed
>>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
>>>Iteration 1
>>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 
>>>1
>>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
>>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
>>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
>>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
>>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
>>>waiting after reserving slots (ms)=0
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>>>waiting after reserving slots (ms)=0
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>>>bytes=2246233
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>>
>>>if I increase the --numClusters argument (e.g. 50), then it will return 
>>>exception after
>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>
>>>and would retry again (also reproducible using 0.6-snapshot)
>>>
>>>...
>>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
>>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>>>attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>>valid local directory for output/file.out
>>>        at 
>>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>        at 
>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>        at 
>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>        at 
>>>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>        at 
>>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>        at 
>>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>        at 
>>>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>>        at 
>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>
>>>11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
>>>...
>>>
>>>Then I ran cluster dumper to dump information about the clusters, this 
>>>command would work if I only care about the cluster centroids (both 0.5 
>>>release and 0.6-snapshot)
>>>
>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>image-tag-clusters.txt
>>>Running on hadoop, using 
>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>MAHOUT-JOB: 
>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>>--startPhase=0, --tempDir=temp}
>>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms
>>>
>>>but if I want to see the degree of membership of each points, I get another 
>>>exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>>
>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>Running on hadoop, using 
>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>MAHOUT-JOB: 
>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop 
>>>library
>>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>>native-zlib library
>>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>>Exception in thread "main" java.lang.ClassCastException: 
>>>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
>>>        at 
>>>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>        at 
>>>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>        at 
>>>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>        at 
>>>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at 
>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>        at 
>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>        at 
>>>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>        at 
>>>org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at 
>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>        at 
>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>>erm, would writing a short program to call the API (btw, can't seem to find 
>>>the latest API doc?) be a better choice here? Or did I do anything wrong 
>>>here (yes, Java is not my main language, and I am very new to Mahout.. and 
>>>h)?
>>>
>>>the data is converted from an arff file with about 1000 rows (resource) and 
>>>14k columns (tag), and it is just a subset of my data. (actually made a 
>>>mistake so it is now generating resource clusters instead of tag clusters, 
>>>but I am just doing this as a proof of concept whether mahout is good enough 
>>>for the task)
>>>
>>>Best wishes,
>>>Jeffrey04
>>>
>>>
>>>
>>
>>
>>
>
>
>

Re: fkmeans or Cluster Dumper not working?

Reply via email to