Re: fkmeans or Cluster Dumper not working?

Jeffrey Wed, 27 Jul 2011 01:40:02 -0700

Hi Lance,

got the shell script working already, thanks :)


is actually still looking for workaround to the original problem. If dumping 
takes that much of resources, is there a way to do it so that it won't (or 
reduce the chance) end up in OME?

Best wishes,
Jeffrey04


----- Original Message -----
> From: Lance Norskog <[email protected]>
> To: [email protected]; Jeffrey <[email protected]>
> Cc: 
> Sent: Wednesday, July 27, 2011 4:15 PM
> Subject: Re: fkmeans or Cluster Dumper not working?
> 
>T he fix got checked in this afternoon. The problem is that a line in
> the shell script surrounds mahout-examples-*.job with quotes. This
> makes it not "glob expand the wildcard" to find the actual job file.
> 
> look in the bin/mahout shell script, around line 127
> 
> On 7/27/11, Jeffrey <[email protected]> wrote:
>>  erm, is there any workaround to the problem?
>> 
>> 
>>  ----- Original Message -----
>>>  From: Jeff Eastman <[email protected]>
>>>  To: "[email protected]" <[email protected]>
>>>  Cc:
>>>  Sent: Tuesday, July 26, 2011 1:12 PM
>>>  Subject: RE: fkmeans or Cluster Dumper not working?
>>> 
>>>  Also makes sense that fuzzyk centroids would be completely dense, since
>>>  every
>>>  point is a member of every cluster. My reducer heaps are 4G.
>>> 
>>>  -----Original Message-----
>>>  From: Jeff Eastman [mailto:[email protected]]
>>>  Sent: Monday, July 25, 2011 2:32 PM
>>>  To: [email protected]; Jeffrey
>>>  Subject: RE: fkmeans or Cluster Dumper not working?
>>> 
>>>  I'm able to run fuzzyk on your data set with k=10 and k=50 without
>>>  problems.
>>>  I also ran it fine with k=100 just to push it a bit harder. Runs took
>>>  longer as
>>>  k increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s,
>>>  45s,
>>>  1m11s). The cluster dumper is throwing an OME with your data points and
>>>  probably
>>>  also with the larger cluster volumes, suggesting it needs a larger -Xmx
>>>  value
>>>  since it is running locally and not influenced by the cluster vm
>>>  parameters.
>>> 
>>>  I will try some more and keep you updated.
>>> 
>>>  The cluster dumper is throwing an OME trying to inhale all your data
>>>  points. It
>>>  is running locally
>>> 
>>>  -----Original Message-----
>>>  From: Jeffrey [mailto:[email protected]]
>>>  Sent: Sunday, July 24, 2011 12:51 AM
>>>  To: [email protected]
>>>  Subject: Re: fkmeans or Cluster Dumper not working?
>>> 
>>>  Erm, is there any update? is the problem reproducible?
>>> 
>>>  Best wishes,
>>>  Jeffrey04
>>> 
>>> 
>>> 
>>>>  ________________________________
>>>>  From: Jeffrey <[email protected]>
>>>>  To: Jeff Eastman <[email protected]>;
>>>  "[email protected]" <[email protected]>
>>>>  Sent: Friday, July 22, 2011 12:40 AM
>>>>  Subject: Re: fkmeans or Cluster Dumper not working?
>>>> 
>>>> 
>>>>  Hi Jeff,
>>>> 
>>>> 
>>>>  lol, this is probably my last reply before i fall asleep (GMT+8 
> here).
>>>> 
>>>> 
>>>>  First thing first, data file is here: 
> http://coolsilon.com/image-tag.mvc
>>>> 
>>>> 
>>>>  Q: What is the cardinality of your vector data?
>>>>  about 1000+ rows (resources) * 14 000+ columns (tags)
>>>>  Q: Is it sparse or dense?
>>>>  sparse (assuming sparse = each vector contains mostly 0)
>>>>  Q: How many vectors are you trying to cluster?
>>>>  all of them? (1000+ rows)
>>>>  Q: What is the exact error you see when fkmeans fails with k=10? 
> With
>>>>  k=50?
>>>>  i think i posted the exception when k=50, but will post them again 
> here
>>>> 
>>>> 
>>>>  k=10, fkmeans actually works, but cluster dumper returns exception,
>>>>  however,
>>>  if i take out --pointsDir, then it would work (output looks ok, but
>>>  without all
>>>  the points)
>>>> 
>>>> 
>>>>      $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>>>  sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>>>  --overwrite
>>>  --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>>>      ...
>>>>      $ bin/mahout clusterdump --seqFileDir 
> sensei/clusters/clusters-1
>>>  --pointsDir sensei/clusters/clusteredPoints --output
>>>  image-tag-clusters.txt
>>>  Running on hadoop, using
>>>  HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>      
> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>      MAHOUT-JOB:
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>      11/07/22 00:14:50 INFO common.AbstractJob: Command line 
> arguments:
>>>  {--dictionaryType=text, --endPhase=2147483647,
>>>  --output=image-tag-clusters.txt,
>>>  --pointsDir=sensei/clusters/clusteredPoints,
>>>  --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, 
> --tempDir=temp}
>>>>      Exception in thread "main" 
> java.lang.OutOfMemoryError: Java
>>>  heap space
>>>>              at java.lang.Object.clone(Native Method)
>>>>              at
>>>  org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>>>>              at
>>>  org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>>>>              at
>>> 
> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>>>>              at
>>> 
> org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>>>>              at
>>> 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>>>>              at
>>>  org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>>>>              at
>>> 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>>>>              at
>>> 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>>>>              at
>>> 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>>>>              at
>>> 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>>>>              at
>>>  com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>>>>              at
>>> 
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>>>>              at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>>>>              at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>              at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>              at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>  Method)
>>>>              at
>>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>              at
>>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>              at java.lang.reflect.Method.invoke(Method.java:616)
>>>>              at
>>> 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>              at
>>>  org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>              at
>>>  org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>  Method)
>>>>              at
>>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>              at
>>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>              at java.lang.reflect.Method.invoke(Method.java:616)
>>>>              at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>      $ bin/mahout clusterdump --seqFileDir 
> sensei/clusters/clusters-1
>>>  --output image-tag-clusters.txt Running on hadoop, using
>>>  HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>      
> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>      MAHOUT-JOB:
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>      11/07/22 00:19:04 INFO common.AbstractJob: Command line 
> arguments:
>>>  {--dictionaryType=text, --endPhase=2147483647,
>>>  --output=image-tag-clusters.txt,
>>>  --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, 
> --tempDir=temp}
>>>>      11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 
> ms
>>>> 
>>>> 
>>>>  k=50, fkmeans shows exception after map 100% reduce 0%, and would 
> retry
>>>>  (map
>>>  0% reduce 0%) after the exception
>>>> 
>>>> 
>>>>      $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>>>  sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>>>  --overwrite
>>>  --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>>>      Running on hadoop, using
>>>  HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>      
> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>      MAHOUT-JOB:
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>      11/07/22 00:21:07 INFO common.AbstractJob: Command line 
> arguments:
>>>  {--clustering=null, --clusters=sensei/clusters/clusters-0,
>>>  --convergenceDelta=0.5,
>>> 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>> 
>>>  --emitMostLikely=false, --endPhase=2147483647,
>>>  --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10,
>>>  --method=mapreduce,
>>>  --numClusters=50, --output=sensei/clusters, --overwrite=null,
>>>  --startPhase=0,
>>>  --tempDir=temp, --threshold=0}
>>>>      11/07/22 00:21:09 INFO common.HadoopUtil: Deleting 
> sensei/clusters
>>>>      11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the
>>>>  native-hadoop
>>>  library
>>>>      11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded 
> &
>>>  initialized native-zlib library
>>>>      11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new 
> compressor
>>>>      11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new 
> decompressor
>>>>      11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 
> vectors
>>>>  to
>>>  sensei/clusters/clusters-0/part-randomSeed
>>>>      11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy 
> K-Means
>>>  Iteration 1
>>>>      11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths 
> to
>>>  process : 1
>>>>      11/07/22 00:21:26 INFO mapred.JobClient: Running job:
>>>  job_201107211512_0029
>>>>      11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>>>>      11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>>>>      11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>>>>      11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>>>>      11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>>>>      11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>>>>      11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>      11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>>>>      11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>>>>      11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>>>>      11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>>>>      11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>>>>      11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>>>>      11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>>>>      11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>>>>      11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>>>>      11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>>>>      11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>>>>      11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>>>>      11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>>>>      11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>>>>      11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>>>>      11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>>>>      11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>>>>      11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>>>>      11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>      11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>>>>      11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>>>>      11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>>>>      11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>>>>      11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>>>>      11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>>>>      11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>>>>      11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>>>>      11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>>>>      11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>>>>      11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>>>>      11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>>>>      11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>>>>      11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>>>>      11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>>>>      11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>>>>      11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>>>>      11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>>>>      11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>>>>      11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>>>>      11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>>>>      11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>>>>      11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>>>>      11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>>>>      11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>>>>      11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>>>>      11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>>>>      11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>>>>      11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>>>>      11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>>>>      11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>>>>      11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>>>>      11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>>>>      11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>>>>      11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>>>>      11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>>>>      11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>>>>      11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>>>>      11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>>>>      11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>>>>      11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>>>>      11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>>>>      11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>>>>      11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>>>>      11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>>>>      11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>>>>      11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>>>>      11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>>>>      11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>>>>      11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>>>>      11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>>>>      11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>>>>      11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>>>>      11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>>>>      11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>>>>      11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>>>>      11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>>>>      11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>>>>      11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>>>>      11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>>>>      11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>>>>      11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>>>>      11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>>>>      11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>>>>      11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>>>>      11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>>>>      11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>>>>      11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>>>>      11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>>>>      11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>>>>      11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>>>>      11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>>>>      11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>>>>      11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>>>>      11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>      11/07/22 00:35:07 INFO mapred.JobClient: Task Id :
>>>  attempt_201107211512_0029_m_000000_0, Status : FAILED
>>>>      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could 
> not find
>>>> 
>>>  any valid local directory for output/file.out
>>>>              at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>              at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>              at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>              at
>>> 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>              at
>>> 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>              at
>>> 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>              at
>>> 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>              at
>>>  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>              at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>              at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>              at java.security.AccessController.doPrivileged(Native 
> Method)
>>>>              at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>              at
>>> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>              at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>> 
>>>> 
>>>>      11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>>>>      ...
>>>> 
>>>> 
>>>>  Q: What are the Hadoop heap settings you are using for your job?
>>>>  I am new to hadoop, not sure where to get those, but got these from
>>>  localhost:50070, is it right?
>>>>  147 files and directories, 60 blocks = 207 total. Heap Size is 
> 31.57 MB /
>>>> 
>>>  966.69 MB (3%)
>>>> 
>>>> 
>>>>  p/s: i keep forgetting to include my operating environment, sorry. 
> I
>>>  basically run this in a guest operating system (in a virtualbox virtual
>>>  machine), assigned 1 CPU core, and 1.5GB of memory. Then the host
>>>  operating
>>>  system is OS X 10.6.8 running on alubook (macbook late 2008 model) with
>>>  4GB of
>>>  memory.
>>>> 
>>>> 
>>>>      $ cat /etc/*-release
>>>>      DISTRIB_ID=Ubuntu
>>>>      DISTRIB_RELEASE=11.04
>>>>      DISTRIB_CODENAME=natty
>>>>      DISTRIB_DESCRIPTION="Ubuntu 11.04"
>>>>      $ uname -a
>>>>      Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 
> 15:05:41 UTC
>>>> 
>>>  2011 i686 i686 i386 GNU/Linux
>>>> 
>>>> 
>>>>  Best wishes,
>>>>  Jeffrey04
>>>> 
>>>>>  ________________________________
>>>>>  From: Jeff Eastman <[email protected]>
>>>>>  To: "[email protected]" 
> <[email protected]>;
>>>  Jeffrey <[email protected]>
>>>>>  Sent: Thursday, July 21, 2011 11:54 PM
>>>>>  Subject: RE: fkmeans or Cluster Dumper not working?
>>>>> 
>>>>>  Excellent, so this appears to be localized to fuzzyk. 
> Unfortunately, the
>>>>> 
>>>  Apache mail server strips off attachments so you'd need another 
> mechanism
>>>  (a
>>>  JIRA?) to upload your data if it is not too large. Some more questions 
> in
>>>  the
>>>  interim:
>>>>> 
>>>>>  - What is the cardinality of your vector data?
>>>>>  - Is it sparse or dense?
>>>>>  - How many vectors are you trying to cluster?
>>>>>  - What is the exact error you see when fkmeans fails with k=10? 
> With
>>>  k=50?
>>>>>  - What are the Hadoop heap settings you are using for your job?
>>>>> 
>>>>>  -----Original Message-----
>>>>>  From: Jeffrey [mailto:[email protected]]
>>>>>  Sent: Thursday, July 21, 2011 11:21 AM
>>>>>  To: [email protected]
>>>>>  Subject: Re: fkmeans or Cluster Dumper not
>>>  working?
>>>>> 
>>>>>  Hi Jeff,
>>>>> 
>>>>>  Q: Did you change your invocation to specify a different -c 
> directory
>>>  (e.g. clusters-0)?
>>>>>  A: Yes :)
>>>>> 
>>>>>  Q: Did you add the -cl argument?
>>>>>  A: Yes :)
>>>>> 
>>>>>  $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>>>  sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>>>  --overwrite
>>>  --emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>>>>>  $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>>>  sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>>>  --overwrite
>>>  --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>>>>  $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output
>>>  sensei/clusters --clusters sensei/clusters/clusters-0 --clustering
>>>  --overwrite
>>>  --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>>>> 
>>>>>  Q: What is the new CLI invocation for clusterdump?
>>>>>  A:
>>>>>  $ bin/mahout clusterdump --seqFileDir 
> sensei/clusters/clusters-4
>>>  --pointsDir
>>>  sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>>>> 
>>>>> 
>>>>>  Q: Did this work for -k 10? What happens with -k 50?
>>>>>  A: works for k=5 (but i don't see the points), but not 
> k=10, fkmeans
>>>  fails when k=50, so i can't dump when k=50
>>>>> 
>>>>>  Q: Have you tried kmeans?
>>>>>  A: Yes (all tested on 0.6-snapshot)
>>>>> 
>>>>>  k=5: no problem :)
>>>>>  k=10: no problem :)
>>>>>  k=50: no problem :)
>>>>> 
>>>>>  p/s: attached with the test data i used (in mvc format), let me 
> know if
>>>  you guys prefer raw data in arff format
>>>>> 
>>>>>  Best wishes,
>>>>>  Jeffrey04
>>>>> 
>>>>> 
>>>>> 
>>>>>>  ________________________________
>>>>>>  From: Jeff Eastman <[email protected]>
>>>>>>  To: "[email protected]"
>>>  <[email protected]>; Jeffrey <[email protected]>
>>>>>>  Sent: Thursday, July 21, 2011 9:36 PM
>>>>>>  Subject: RE: fkmeans or Cluster Dumper not working?
>>>>>> 
>>>>>>  You are correct, the wiki for fkmeans did not mention the 
> -cl
>>>  argument. I've added that just now. I think this is what Frank 
> means in
>>>  his
>>>  comment but you do *not* have to write any custom code to get the 
> cluster
>>>  dumper
>>>  to do what you want, just use the -cl argument and specify 
> clusteredPoints
>>>  as
>>>  the -p input to clusterdump.
>>>>>> 
>>>>>>  Check out TestClusterDumper.testKmeans and 
> .testFuzzyKmeans. These
>>>  show how to invoke the clustering and cluster dumper from Java at 
> least.
>>>>>> 
>>>>>>  Did you change your invocation to specify a different -c 
> directory
>>>  (e.g. clusters-0)?
>>>>>>  Did you add the -cl argument?
>>>>>>  What is the new CLI invocation for clusterdump?
>>>>>>  Did this work for -k 10? What happens with -k
>>>  50?
>>>>>>  Have you tried kmeans?
>>>>>> 
>>>>>>  I can help you better if you will give me answers to my 
> questions
>>>>>> 
>>>>>>  -----Original Message-----
>>>>>>  From: Jeffrey [mailto:[email protected]]
>>>>>>  Sent: Thursday, July 21, 2011 4:30 AM
>>>>>>  To: [email protected]
>>>>>>  Subject: Re: fkmeans or Cluster Dumper not working?
>>>>>> 
>>>>>>  Hi again,
>>>>>> 
>>>>>>  Let me update on what's working and what's not 
> working.
>>>>>> 
>>>>>>  Works:
>>>>>>  fkmeans clustering (10 clusters) - thanks Jeff for the --cl 
> tip
>>>>>>  fkmeans clustering (5 clusters)
>>>>>>  clusterdump (5 clusters) - so points are not included in 
> the
>>>  clusterdump and I need to write a program for it?
>>>>>> 
>>>>>>  Not Working:
>>>>>>  fkmeans clustering (50 clusters) - same error
>>>>>>  clusterdump (10
>>>  clusters) - same error
>>>>>> 
>>>>>> 
>>>>>>  so it seems to attach points to the cluster dumper output 
> like the
>>>  synthetic control example does, i would have to write some code as 
> pointed
>>>  by
>>>  @Frank_Scholten ?
>>>  https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>>>> 
>>>>>>  Best wishes,
>>>>>>  Jeffrey04
>>>>>> 
>>>>>>>  ________________________________
>>>>>>>  From: Jeff Eastman <[email protected]>
>>>>>>>  To: "[email protected]"
>>>  <[email protected]>; Jeffrey <[email protected]>
>>>>>>>  Sent: Wednesday, July 20, 2011 11:53 PM
>>>>>>>  Subject: RE: fkmeans or Cluster Dumper not working?
>>>>>>> 
>>>>>>>  Hi Jeffrey,
>>>>>>> 
>>>>>>>  It is always difficult to debug remotely, but here are 
> some
>>>  suggestions:
>>>>>>>  - First, you are specifying both an input clusters 
> directory
>>>  --clusters and --numClusters clusters so the job is sampling 10 points
>>>  from your
>>>  input data set and writing them to clusteredPoints as the prior 
> clusters
>>>  for the
>>>  first iteration. You should pick a different name for this directory, 
> as
>>>  the
>>>  clusteredPoints directory is used by the -cl (--clustering) option 
> (which
>>>  you
>>>  did not supply) to write out the clustered (classified) input vectors.
>>>  When you
>>>  subsequently supplied clusteredPoints to the clusterdumper it was
>>>  expecting a
>>>  different format and that caused the exception you saw. Change your
>>>  --clusters
>>>  directory (clusters-0 is good)
>>>  and add a -cl argument and things should go more smoothly. The -cl 
> option
>>>  is not
>>>  the default and so no clustering of the input points is performed 
> without
>>>  this
>>>  (Many people get caught by this and perhaps the default should be 
> changed,
>>>  but
>>>  clustering can be expensive and so it is not performed without 
> request).
>>>>>>>  - If you still have problems, try again with k-means. 
> The
>>>  similarity to fkmeans is good and it will eliminate fkmeans itself if 
> you
>>>  see
>>>  the same problems with k-means
>>>>>>>  - I don't see why changing the -k argument from 10 
> to 50
>>>  should cause any problems, unless your vectors are very large and you 
> are
>>>  getting an OME in the reducer. Since the reducer is calculating 
> centroid
>>>  vectors
>>>  for the next iteration these will become more dense and memory will
>>>  increase
>>>  substantially.
>>>>>>>  - I can't figure out what might be causing your 
> second
>>>  exception. It is bombing inside of Hadoop file IO and this causes me to
>>>  suspect
>>>  command argument
>>>  problems.
>>>>>>> 
>>>>>>>  Hope this helps,
>>>>>>>  Jeff
>>>>>>> 
>>>>>>> 
>>>>>>>  -----Original Message-----
>>>>>>>  From: Jeffrey [mailto:[email protected]]
>>>>>>>  Sent: Wednesday, July 20, 2011 2:41 AM
>>>>>>>  To: [email protected]
>>>>>>>  Subject: fkmeans or Cluster Dumper not working?
>>>>>>> 
>>>>>>>  Hi,
>>>>>>> 
>>>>>>>  I am trying to generate clusters using the fkmeans 
> command line
>>>  tool from my test data. Not sure if this is correct, as it only runs 
> one
>>>  iteration (output from 0.6-snapshot, gotta use some workaround to some
>>>  weird bug
>>>  -
>>> 
> http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>>> 
>>>  )
>>>>>>> 
>>>>>>>  $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc 
> --output
>>>  sensei/clusters --clusters sensei/clusteredPoints --maxIter 10
>>>  --numClusters 10
>>>  --overwrite --m 5
>>>>>>>  Running on hadoop, using
>>> 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>>> 
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>>> 
>>>  14:05:18 INFO common.AbstractJob: Command line arguments:
>>>  {--clusters=sensei/clusteredPoints, --convergenceDelta=0.5,
>>> 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>> 
>>>  --emitMostLikely=true, --endPhase=2147483647,
>>>  --input=sensei/image-tag.arff.mvc,
>>>  --m=5, --maxIter=10, --method=mapreduce, --numClusters=10,
>>>  --output=sensei/clusters, --overwrite=null, --startPhase=0,
>>>  --tempDir=temp,
>>>  --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting
>>>  sensei/clusters11/07/20
>>>  14:05:20 INFO common.HadoopUtil: Deleting 
> sensei/clusteredPoints11/07/20
>>>  14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>  library11/07/20
>>>  14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>>>>  loaded & initialized native-zlib library11/07/20 
> 14:05:20
>>>  INFO compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO
>>>  compress.CodecPool: Got brand-new decompressor
>>>>>>>  11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: 
> Wrote 10
>>>  vectors to sensei/clusteredPoints/part-randomSeed
>>>>>>>  11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: 
> Fuzzy
>>>  K-Means Iteration 1
>>>>>>>  11/07/20 14:05:30 INFO input.FileInputFormat: Total 
> input paths
>>>  to process : 1
>>>>>>>  11/07/20 14:05:30 INFO mapred.JobClient: Running job:
>>>  job_201107201152_0021
>>>>>>>  11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 
> 0%
>>>>>>>  11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 
> 0%
>>>>>>>  11/07/20 14:05:57 INFO
>>>  mapred.JobClient:  map 5% reduce 0%
>>>>>>>  11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 
> 0%
>>>>>>>  11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 
> 0%
>>>>>>>  11/07/20 14:06:07 INFO mapred.JobClient:  map 10% 
> reduce 0%
>>>>>>>  11/07/20 14:06:10 INFO mapred.JobClient:  map 13% 
> reduce 0%
>>>>>>>  11/07/20 14:06:13 INFO mapred.JobClient:  map 15% 
> reduce 0%
>>>>>>>  11/07/20 14:06:16 INFO mapred.JobClient:  map 17% 
> reduce 0%
>>>>>>>  11/07/20 14:06:19 INFO mapred.JobClient:  map 19% 
> reduce 0%
>>>>>>>  11/07/20 14:06:22 INFO mapred.JobClient:  map 23% 
> reduce 0%
>>>>>>>  11/07/20 14:06:25 INFO mapred.JobClient:  map 25% 
> reduce 0%
>>>>>>>  11/07/20 14:06:28 INFO mapred.JobClient:  map 27% 
> reduce 0%
>>>>>>>  11/07/20 14:06:31 INFO mapred.JobClient:  map 30% 
> reduce 0%
>>>>>>>  11/07/20 14:06:34 INFO mapred.JobClient:  map 33% 
> reduce
>>>  0%
>>>>>>>  11/07/20 14:06:37 INFO mapred.JobClient:  map 36% 
> reduce 0%
>>>>>>>  11/07/20 14:06:40 INFO mapred.JobClient:  map 37% 
> reduce 0%
>>>>>>>  11/07/20 14:06:43 INFO mapred.JobClient:  map 40% 
> reduce 0%
>>>>>>>  11/07/20 14:06:46 INFO mapred.JobClient:  map 43% 
> reduce 0%
>>>>>>>  11/07/20 14:06:49 INFO mapred.JobClient:  map 46% 
> reduce 0%
>>>>>>>  11/07/20 14:06:52 INFO mapred.JobClient:  map 48% 
> reduce 0%
>>>>>>>  11/07/20 14:06:55 INFO mapred.JobClient:  map 50% 
> reduce 0%
>>>>>>>  11/07/20 14:06:57 INFO mapred.JobClient:  map 53% 
> reduce 0%
>>>>>>>  11/07/20 14:07:00 INFO mapred.JobClient:  map 56% 
> reduce 0%
>>>>>>>  11/07/20 14:07:03 INFO mapred.JobClient:  map 58% 
> reduce 0%
>>>>>>>  11/07/20 14:07:06 INFO mapred.JobClient:  map 60% 
> reduce 0%
>>>>>>>  11/07/20 14:07:09 INFO mapred.JobClient:  map 63% 
> reduce 0%
>>>>>>>  11/07/20 14:07:13 INFO
>>>  mapred.JobClient:  map 65% reduce 0%
>>>>>>>  11/07/20 14:07:16 INFO mapred.JobClient:  map 67% 
> reduce 0%
>>>>>>>  11/07/20 14:07:19 INFO mapred.JobClient:  map 70% 
> reduce 0%
>>>>>>>  11/07/20 14:07:22 INFO mapred.JobClient:  map 73% 
> reduce 0%
>>>>>>>  11/07/20 14:07:25 INFO mapred.JobClient:  map 75% 
> reduce 0%
>>>>>>>  11/07/20 14:07:28 INFO mapred.JobClient:  map 77% 
> reduce 0%
>>>>>>>  11/07/20 14:07:31 INFO mapred.JobClient:  map 80% 
> reduce 0%
>>>>>>>  11/07/20 14:07:34 INFO mapred.JobClient:  map 83% 
> reduce 0%
>>>>>>>  11/07/20 14:07:37 INFO mapred.JobClient:  map 85% 
> reduce 0%
>>>>>>>  11/07/20 14:07:40 INFO mapred.JobClient:  map 87% 
> reduce 0%
>>>>>>>  11/07/20 14:07:43 INFO mapred.JobClient:  map 89% 
> reduce 0%
>>>>>>>  11/07/20 14:07:46 INFO mapred.JobClient:  map 92% 
> reduce 0%
>>>>>>>  11/07/20 14:07:49 INFO mapred.JobClient:  map 95% 
> reduce
>>>  0%
>>>>>>>  11/07/20 14:07:55 INFO mapred.JobClient:  map 98% 
> reduce 0%
>>>>>>>  11/07/20 14:07:59 INFO mapred.JobClient:  map 99% 
> reduce 0%
>>>>>>>  11/07/20 14:08:02 INFO mapred.JobClient:  map 100% 
> reduce 0%
>>>>>>>  11/07/20 14:08:23 INFO mapred.JobClient:  map 100% 
> reduce 100%
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient: Job complete:
>>>  job_201107201152_0021
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Launched 
> reduce
>>>  tasks=1
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  SLOTS_MILLIS_MAPS=149314
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Total time 
> spent by
>>>  all reduces waiting after reserving slots (ms)=0
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Total time 
> spent by
>>>  all maps waiting after
>>>  reserving slots (ms)=0
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Launched 
> map
>>>  tasks=1
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Data-local 
> map
>>>  tasks=1
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  SLOTS_MILLIS_REDUCES=15618
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   File Output 
> Format
>>>  Counters
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Bytes
>>>  Written=2247222
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Converged
>>>  Clusters=10
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   
> FileSystemCounters
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  FILE_BYTES_READ=130281382
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  HDFS_BYTES_READ=254494
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  FILE_BYTES_WRITTEN=132572666
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:
>>>  HDFS_BYTES_WRITTEN=2247222
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   File Input 
> Format
>>>  Counters
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Bytes 
> Read=247443
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce 
> Framework
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Reduce 
> input
>>>  groups=10
>>>>>>>  11/07/20 14:08:31 INFO mapred.JobClient:     Map output
>>>  materialized bytes=2246233
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Combine 
> output
>>>  records=330
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Map input
>>>  records=1113
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Reduce 
> shuffle
>>>  bytes=2246233
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Reduce 
> output
>>>  records=10
>>>>>>>  11/07/20 14:08:32 INFO
>>>  mapred.JobClient:     Spilled Records=590
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Map output
>>>  bytes=2499995001
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Combine 
> input
>>>  records=11450
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Map output
>>>  records=11130
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     
> SPLIT_RAW_BYTES=127
>>>>>>>  11/07/20 14:08:32 INFO mapred.JobClient:     Reduce 
> input
>>>  records=10
>>>>>>>  11/07/20 14:08:32 INFO driver.MahoutDriver: Program 
> took 194096
>>>  ms
>>>>>>> 
>>>>>>>  if I increase the --numClusters argument (e.g. 50), 
> then it will
>>>  return exception after
>>>>>>>  11/07/20 14:08:02 INFO mapred.JobClient:  map 100% 
> reduce 0%
>>>>>>> 
>>>>>>>  and would retry again (also reproducible using 
> 0.6-snapshot)
>>>>>>> 
>>>>>>>  ...
>>>>>>>  11/07/20 14:22:25 INFO mapred.JobClient:  map 100% 
> reduce
>>>  0%
>>>>>>>  11/07/20 14:22:30 INFO mapred.JobClient: Task Id :
>>>  attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>>>>>  org.apache.hadoop.util.DiskChecker$DiskErrorException: 
> Could not
>>>  find any valid local directory for output/file.out
>>>>>>>          at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>>>>          at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>>>>          at
>>> 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>>>>          at
>>> 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>>>>          at
>>> 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>>>>          at
>>> 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>>>>          at
>>> 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>>>>          at
>>>  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>>>>          at
>>>  org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>>>>          at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>>>>          at 
> java.security.AccessController.doPrivileged(Native
>>>  Method)
>>>>>>>          at 
> javax.security.auth.Subject.doAs(Subject.java:416)
>>>>>>>          at
>>> 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>>>          at 
> org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>>>> 
>>>>>>>  11/07/20 14:22:32 INFO
>>>  mapred.JobClient:  map 0% reduce 0%
>>>>>>>  ...
>>>>>>> 
>>>>>>>  Then I ran cluster dumper to dump information about the
>>>  clusters, this command would work if I only care about the cluster
>>>  centroids
>>>  (both 0.5 release and 0.6-snapshot)
>>>>>>> 
>>>>>>>  $ bin/mahout clusterdump --seqFileDir 
> sensei/clusters/clusters-1
>>>  --output image-tag-clusters.txt
>>>>>>>  Running on hadoop, using
>>>  HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>>> 
> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>>>  MAHOUT-JOB:
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>>>  11/07/20 14:33:45 INFO common.AbstractJob: Command line
>>>  arguments: {--dictionaryType=text, --endPhase=2147483647,
>>>  --output=image-tag-clusters.txt, 
> --seqFileDir=sensei/clusters/clusters-1,
>>>  --startPhase=0, --tempDir=temp}
>>>>>>>  11/07/20 14:33:56 INFO driver.MahoutDriver: Program 
> took 11761
>>>  ms
>>>>>>> 
>>>>>>>  but if I want to see the degree of membership of each 
> points, I
>>>  get another exception (yes, reproducible for both 0.5 release and
>>>  0.6-snapshot)
>>>>>>> 
>>>>>>>  $ bin/mahout clusterdump --seqFileDir 
> sensei/clusters/clusters-1
>>>  --output image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>>>>  Running on hadoop, using
>>>  HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>>>> 
> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>>>>  MAHOUT-JOB:
>>> 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>>>>  11/07/20 14:35:08 INFO common.AbstractJob: Command line
>>>  arguments: {--dictionaryType=text, --endPhase=2147483647,
>>>  --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints,
>>>  --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, 
> --tempDir=temp}
>>>>>>>  11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded 
> the
>>>  native-hadoop
>>>  library
>>>>>>>  11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully 
> loaded
>>>  & initialized native-zlib library
>>>>>>>  11/07/20 14:35:10 INFO compress.CodecPool: Got 
> brand-new
>>>  decompressor
>>>>>>>  Exception in thread "main"
>>>  java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast 
> to
>>>  org.apache.hadoop.io.IntWritable
>>>>>>>          at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>>>>          at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>>>>          at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>>>>          at
>>> 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>>>>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>  Method)
>>>>>>> 
>>>     at
>>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>          at
>>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>          at 
> java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>>          at
>>> 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>>          at
>>>  org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>>          at
>>>  org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>>>>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>  Method)
>>>>>>>          at
>>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>          at
>>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>          at 
> java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>>          at 
> org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>>>> 
>>>>>>>  erm, would writing a short program to call the API 
> (btw,
>>>  can't seem to find the latest API doc?) be a better choice here? Or 
> did I
>>>  do
>>>  anything wrong here (yes, Java is not my main language, and I am very 
> new
>>>  to
>>>  Mahout.. and h)?
>>>>>>> 
>>>>>>>  the data is converted from an arff file with about 1000 
> rows
>>>  (resource) and 14k columns (tag), and it is just a subset of my data.
>>>  (actually
>>>  made a mistake so it is now generating resource clusters instead of tag
>>>  clusters, but I am just doing this as a proof of concept whether mahout 
> is
>>>  good
>>>  enough for the task)
>>>>>>> 
>>>>>>>  Best
>>>  wishes,
>>>>>>>  Jeffrey04
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> Lance Norskog
> [email protected]
>

Re: fkmeans or Cluster Dumper not working?

Reply via email to