No worries :)


>________________________________
>From: Jeff Eastman <[email protected]>
>To: "[email protected]" <[email protected]>; Jeffrey 
><[email protected]>
>Sent: Tuesday, July 26, 2011 12:30 AM
>Subject: RE: fkmeans or Cluster Dumper not working?
>
>Sorry, I was traveling over the weekend. I will take a look at your data asap.
>
>-----Original Message-----
>From: Jeffrey [mailto:[email protected]]
>Sent: Sunday, July 24, 2011 3:51 AM
>To: [email protected]
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>Erm, is there any update? is the problem reproducible?
>
>Best wishes,
>Jeffrey04
>
>
>
>>________________________________
>>From: Jeffrey <[email protected]>
>>To: Jeff Eastman <[email protected]>; "[email protected]" 
>><[email protected]>
>>Sent: Friday, July 22, 2011 12:40 AM
>>Subject: Re: fkmeans or Cluster Dumper not working?
>>
>>
>>Hi Jeff,
>>
>>
>>lol, this is probably my last reply before i fall asleep (GMT+8 here).
>>
>>
>>First thing first, data file is here: http://coolsilon.com/image-tag.mvc
>>
>>
>>Q: What is the cardinality of your vector data?
>>about 1000+ rows (resources) * 14 000+ columns (tags)
>>Q: Is it sparse or dense?
>>sparse (assuming sparse = each vector contains mostly 0)
>>Q: How many vectors are you trying to cluster?
>>all of them? (1000+ rows)
>>Q: What is the exact error you see when fkmeans fails with k=10? With k=50?
>>i think i posted the exception when k=50, but will post them again here
>>
>>
>>k=10, fkmeans actually works, but cluster dumper returns exception, however, 
>>if i take out --pointsDir, then it would work (output looks ok, but without 
>>all the points)
>>
>>
>>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>--overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>    ...
>>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
>>--pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt 
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>    MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>    11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusters/clusteredPoints, 
>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>            at java.lang.Object.clone(Native Method)
>>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>>            at 
>>org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>>            at 
>>org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>>            at 
>>org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>>            at 
>>org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>>            at 
>>org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>>            at 
>>org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>>            at 
>>com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>>            at 
>>com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>>            at 
>>com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>>            at 
>>com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>>            at 
>>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>>            at 
>>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>            at 
>>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>            at 
>>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>            at 
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>            at 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>            at java.lang.reflect.Method.invoke(Method.java:616)
>>            at 
>>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>            at 
>>org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>            at 
>>org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>            at 
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>            at 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>            at java.lang.reflect.Method.invoke(Method.java:616)
>>            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>image-tag-clusters.txt Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>    MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>    11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>--startPhase=0, --tempDir=temp}
>>    11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms
>>
>>
>>k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map 
>>0% reduce 0%) after the exception
>>
>>
>>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>--overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>    Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>    MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>    11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: 
>>{--clustering=null, --clusters=sensei/clusters/clusters-0, 
>>--convergenceDelta=0.5, 
>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --emitMostLikely=false, --endPhase=2147483647, 
>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>>--numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
>>--tempDir=temp, --threshold=0}
>>    11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
>>    11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop 
>>library
>>    11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & 
>>initialized native-zlib library
>>    11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
>>    11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
>>    11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to 
>>sensei/clusters/clusters-0/part-randomSeed
>>    11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
>>Iteration 1
>>    11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to 
>>process : 1
>>    11/07/22 00:21:26 INFO mapred.JobClient: Running job: 
>>job_201107211512_0029
>>    11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>>    11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>>    11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>>    11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>>    11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>>    11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>>    11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>>    11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>>    11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>>    11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>>    11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>>    11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>>    11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>>    11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>>    11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>>    11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>>    11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>>    11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>>    11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>>    11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>>    11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>>    11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>>    11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>>    11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>>    11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>>    11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>>    11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>>    11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>>    11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>>    11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>>    11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>>    11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>>    11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>>    11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>>    11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>>    11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>>    11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>>    11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>>    11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>>    11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>>    11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>>    11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>>    11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>>    11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>>    11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>>    11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>>    11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>>    11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>>    11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>>    11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>>    11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>>    11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>>    11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>>    11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>>    11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>>    11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>>    11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>>    11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>>    11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>>    11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>>    11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>>    11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>>    11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>>    11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>>    11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>>    11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>>    11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>>    11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>>    11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>>    11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>>    11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>>    11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>>    11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>>    11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>>    11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>>    11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>>    11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>>    11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>>    11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>>    11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>>    11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>>    11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>>    11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>>    11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>>    11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>>    11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>>    11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>>    11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>>    11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>>    11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>>    11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>>    11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>>    11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>>    11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>>    11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>>    11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>>    11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>>    11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>>    11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>>    11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>>    11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>>    11/07/22 00:35:07 INFO mapred.JobClient: Task Id : 
>>attempt_201107211512_0029_m_000000_0, Status : FAILED
>>    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>valid local directory for output/file.out
>>            at 
>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>            at 
>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>            at 
>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>            at 
>>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>            at 
>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>            at 
>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>            at 
>>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>            at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>            at java.security.AccessController.doPrivileged(Native Method)
>>            at javax.security.auth.Subject.doAs(Subject.java:416)
>>            at 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>            at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>
>>
>>    11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>>    ...
>>
>>
>>Q: What are the Hadoop heap settings you are using for your job?
>>I am new to hadoop, not sure where to get those, but got these from 
>>localhost:50070, is it right?
>>147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / 
>>966.69 MB (3%)
>>
>>
>>p/s: i keep forgetting to include my operating environment, sorry. I 
>>basically run this in a guest operating system (in a virtualbox virtual 
>>machine), assigned 1 CPU core, and 1.5GB of memory. Then the host operating 
>>system is OS X 10.6.8 running on alubook (macbook late 2008 model) with 4GB 
>>of memory.
>>
>>
>>    $ cat /etc/*-release
>>    DISTRIB_ID=Ubuntu
>>    DISTRIB_RELEASE=11.04
>>    DISTRIB_CODENAME=natty
>>    DISTRIB_DESCRIPTION="Ubuntu 11.04"
>>    $ uname -a
>>    Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC 
>>2011 i686 i686 i386 GNU/Linux
>>
>>
>>Best wishes,
>>Jeffrey04
>>
>>>________________________________
>>>From: Jeff Eastman <[email protected]>
>>>To: "[email protected]" <[email protected]>; Jeffrey 
>>><[email protected]>
>>>Sent: Thursday, July 21, 2011 11:54 PM
>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>
>>>Excellent, so this appears to be localized to fuzzyk. Unfortunately, the 
>>>Apache mail server strips off attachments so you'd need another mechanism (a 
>>>JIRA?) to upload your data if it is not too large. Some more questions in 
>>>the interim:
>>>
>>>- What is the cardinality of your vector data?
>>>- Is it sparse or dense?
>>>- How many vectors are you trying to cluster?
>>>- What is the exact error you see when fkmeans fails with k=10? With k=50?
>>>- What are the Hadoop heap settings you are using for your job?
>>>
>>>-----Original Message-----
>>>From: Jeffrey [mailto:[email protected]]
>>>Sent: Thursday, July 21, 2011 11:21 AM
>>>To: [email protected]
>>>Subject: Re: fkmeans or Cluster Dumper not
>working?
>>>
>>>Hi Jeff,
>>>
>>>Q: Did you change your invocation to specify a different -c directory (e.g. 
>>>clusters-0)?
>>>A: Yes :)
>>>
>>>Q: Did you add the -cl argument?
>>>A: Yes :)
>>>
>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>>--overwrite --emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>>--overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering 
>>>--overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>>
>>>Q: What is the new CLI invocation for clusterdump?
>>>A:
>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir
>sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>>
>>>
>>>Q: Did this work for -k 10? What happens with -k 50?
>>>A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails 
>>>when k=50, so i can't dump when k=50
>>>
>>>Q: Have you tried kmeans?
>>>A: Yes (all tested on 0.6-snapshot)
>>>
>>>k=5: no problem :)
>>>k=10: no problem :)
>>>k=50: no problem :)
>>>
>>>p/s: attached with the test data i used (in mvc format), let me know if you 
>>>guys prefer raw data in arff format
>>>
>>>Best wishes,
>>>Jeffrey04
>>>
>>>
>>>
>>>>________________________________
>>>>From: Jeff Eastman <[email protected]>
>>>>To: "[email protected]" <[email protected]>; Jeffrey 
>>>><[email protected]>
>>>>Sent: Thursday, July 21, 2011 9:36 PM
>>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>>
>>>>You are correct, the wiki for fkmeans did not mention the -cl argument. 
>>>>I've added that just now. I think this is what Frank means in his comment 
>>>>but you do *not* have to write any custom code to get the cluster dumper to 
>>>>do what you want, just use the -cl argument and specify clusteredPoints as 
>>>>the -p input to clusterdump.
>>>>
>>>>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how 
>>>>to invoke the clustering and cluster dumper from Java at least.
>>>>
>>>>Did you change your invocation to specify a different -c directory (e.g. 
>>>>clusters-0)?
>>>>Did you add the -cl argument?
>>>>What is the new CLI invocation for clusterdump?
>>>>Did this work for -k 10? What happens with -k
>50?
>>>>Have you tried kmeans?
>>>>
>>>>I can help you better if you will give me answers to my questions
>>>>
>>>>-----Original Message-----
>>>>From: Jeffrey [mailto:[email protected]]
>>>>Sent: Thursday, July 21, 2011 4:30 AM
>>>>To: [email protected]
>>>>Subject: Re: fkmeans or Cluster Dumper not working?
>>>>
>>>>Hi again,
>>>>
>>>>Let me update on what's working and what's not working.
>>>>
>>>>Works:
>>>>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>>>fkmeans clustering (5 clusters)
>>>>clusterdump (5 clusters) - so points are not included in the clusterdump 
>>>>and I need to write a program for it?
>>>>
>>>>Not Working:
>>>>fkmeans clustering (50 clusters) - same error
>>>>clusterdump (10
>clusters) - same error
>>>>
>>>>
>>>>so it seems to attach points to the cluster dumper output like the 
>>>>synthetic control example does, i would have to write some code as pointed 
>>>>by @Frank_Scholten ? 
>>>>https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>>
>>>>Best wishes,
>>>>Jeffrey04
>>>>
>>>>>________________________________
>>>>>From: Jeff Eastman <[email protected]>
>>>>>To: "[email protected]" <[email protected]>; Jeffrey 
>>>>><[email protected]>
>>>>>Sent: Wednesday, July 20, 2011 11:53 PM
>>>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>>>
>>>>>Hi Jeffrey,
>>>>>
>>>>>It is always difficult to debug remotely, but here are some suggestions:
>>>>>- First, you are specifying both an input clusters directory --clusters 
>>>>>and --numClusters clusters so the job is sampling 10 points from your 
>>>>>input data set and writing them to clusteredPoints as the prior clusters 
>>>>>for the first iteration. You should pick a different name for this 
>>>>>directory, as the clusteredPoints directory is used by the -cl 
>>>>>(--clustering) option (which you did not supply) to write out the 
>>>>>clustered (classified) input vectors. When you subsequently supplied 
>>>>>clusteredPoints to the clusterdumper it was expecting a different format 
>>>>>and that caused the exception you saw. Change your --clusters directory 
>>>>>(clusters-0 is good)
>and add a -cl argument and things should go more smoothly. The -cl option is 
>not the default and so no clustering of the input points is performed without 
>this (Many people get caught by this and perhaps the default should be 
>changed, but clustering can be expensive and so it is not performed without 
>request).
>>>>>- If you still have problems, try again with k-means. The similarity to 
>>>>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>>>>problems with k-means
>>>>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>>>>problems, unless your vectors are very large and you are getting an OME in 
>>>>>the reducer. Since the reducer is calculating centroid vectors for the 
>>>>>next iteration these will become more dense and memory will increase 
>>>>>substantially.
>>>>>- I can't figure out what might be causing your second exception. It is 
>>>>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>>>>argument
>problems.
>>>>>
>>>>>Hope this helps,
>>>>>Jeff
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Jeffrey [mailto:[email protected]]
>>>>>Sent: Wednesday, July 20, 2011 2:41 AM
>>>>>To: [email protected]
>>>>>Subject: fkmeans or Cluster Dumper not working?
>>>>>
>>>>>Hi,
>>>>>
>>>>>I am trying to generate clusters using the fkmeans command line tool from 
>>>>>my test data. Not sure if this is correct, as it only runs one iteration 
>>>>>(output from 0.6-snapshot, gotta use some workaround to some weird bug - 
>>>>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>>>>> )
>>>>>
>>>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>>>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 
>>>>>--numClusters 10 --overwrite --m 5
>>>>>Running on hadoop, using 
>>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>>>>> 
>>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>>>>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>>>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>>>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>> --emitMostLikely=true, --endPhase=2147483647, 
>>>>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, 
>>>>>--method=mapreduce, --numClusters=10, --output=sensei/clusters, 
>>>>>--overwrite=null, --startPhase=0, --tempDir=temp, --threshold=0}11/07/20 
>>>>>14:05:20 INFO common.HadoopUtil: Deleting sensei/clusters11/07/20
>14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 
>14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 
>14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>>>>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>>>>>compress.CodecPool: Got brand-new decompressor
>>>>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>>>>>sensei/clusteredPoints/part-randomSeed
>>>>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
>>>>>Iteration 1
>>>>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process 
>>>>>: 1
>>>>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>>>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>>>11/07/20 14:05:57 INFO
>mapred.JobClient:  map 5% reduce 0%
>>>>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce
>0%
>>>>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>>>11/07/20 14:07:13 INFO
>mapred.JobClient:  map 65% reduce 0%
>>>>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce
>0%
>>>>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: 
>>>>>job_201107201152_0021
>>>>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all 
>>>>>reduces waiting after reserving slots (ms)=0
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>>>>>waiting after
>reserving slots (ms)=0
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:
>FILE_BYTES_WRITTEN=132572666
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>>>>>bytes=2246233
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>>>>11/07/20 14:08:32 INFO
>mapred.JobClient:     Spilled Records=590
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>>>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>>>>
>>>>>if I increase the --numClusters argument (e.g. 50), then it will return 
>>>>>exception after
>>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>
>>>>>and would retry again (also reproducible using 0.6-snapshot)
>>>>>
>>>>>...
>>>>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce
>0%
>>>>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>>>>>attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>>>>valid local directory for output/file.out
>>>>>        at 
>>>>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>>        at 
>>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>>        at 
>>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>>        at 
>>>>>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>>        at 
>>>>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>>        at
>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>>        at 
>>>>>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>>        at 
>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>>
>>>>>11/07/20 14:22:32 INFO
>mapred.JobClient:  map 0% reduce 0%
>>>>>...
>>>>>
>>>>>Then I ran cluster dumper to dump information about the clusters, this 
>>>>>command would work if I only care about the cluster centroids (both 0.5 
>>>>>release and 0.6-snapshot)
>>>>>
>>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>>>image-tag-clusters.txt
>>>>>Running on hadoop, using 
>>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>MAHOUT-JOB: 
>>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>>>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>>>>--startPhase=0, --tempDir=temp}
>>>>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761
>ms
>>>>>
>>>>>but if I want to see the degree of membership of each points, I get 
>>>>>another exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>>>>
>>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>>>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>>Running on hadoop, using 
>>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>MAHOUT-JOB: 
>>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>>>>>{--dictionaryType=text, --endPhase=2147483647, 
>>>>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>>>>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
>library
>>>>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>>>>native-zlib library
>>>>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>>>>Exception in thread "main" java.lang.ClassCastException: 
>>>>>org.apache.hadoop.io.Text cannot be cast to 
>>>>>org.apache.hadoop.io.IntWritable
>>>>>        at 
>>>>>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>>        at 
>>>>>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>>        at 
>>>>>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>>        at 
>>>>>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>
>   at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>        at 
>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>        at 
>>>>>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>        at 
>>>>>org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>        at 
>>>>>org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>        at 
>>>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>        at
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>>
>>>>>erm, would writing a short program to call the API (btw, can't seem to 
>>>>>find the latest API doc?) be a better choice here? Or did I do anything 
>>>>>wrong here (yes, Java is not my main language, and I am very new to 
>>>>>Mahout.. and h)?
>>>>>
>>>>>the data is converted from an arff file with about 1000 rows (resource) 
>>>>>and 14k columns (tag), and it is just a subset of my data. (actually made 
>>>>>a mistake so it is now generating resource clusters instead of tag 
>>>>>clusters, but I am just doing this as a proof of concept whether mahout is 
>>>>>good enough for the task)
>>>>>
>>>>>Best
>wishes,
>>>>>Jeffrey04
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
>

Reply via email to