Re: Clustering on Elastic Map Reduce

Grant Ingersoll Sat, 11 Sep 2010 17:02:55 -0700

I've made a little bit of progress here, but not much.  Here's what I ran:

elastic-mapreduce -j <JOB>  --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  
--main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input 
--arg s3n://news-vecs/part-out.vec --arg --clusters --arg 
s3n://news-vecs/kmeans/clusters/ --arg --k --arg 10 --arg --output --arg 
s3n://news-vecs/out/ --arg --distanceMeasure --arg  
org.apache.mahout.common.distance.CosineDistanceMeasure --arg 
--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg 
--clustering -v --debug


In the controller log, I see:
2010-09-11T23:49:16.958Z INFO Fetching jar file.
2010-09-11T23:49:20.723Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2010-09-11T23:49:20.723Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp 
/home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-0.20-core.jar:/home/hadoop/hadoop-0.20-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/*
 -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 
-Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop 
-Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp 
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 
org.apache.hadoop.util.RunJar 
/mnt/var/lib/hadoop/steps/1/mahout-core-0.4-SNAPSHOT.job 
org.apache.mahout.clustering.kmeans.KMeansDriver --input 
s3n://news-vecs/part-out.vec --clusters s3n://news-vecs/kmeans/clusters/ --k 10 
--output s3n://news-vecs/out/ --distanceMeasure 
org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta 
0.001 --overwrite --maxIter 50 --clustering
2010-09-11T23:49:23.302Z INFO Execution ended with ret val 0
2010-09-11T23:49:25.415Z INFO Step created jobs: 
2010-09-11T23:49:25.416Z INFO Step succeeded

But, then in stdout log I see:
<snip>
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>             comma separated archives to be unarchived
                               on the compute machines.
 -conf <configuration file>    specify an application configuration file
 -D <property=value>           use value for given property
 -files <paths>                comma separated files to be copied to the
                               map reduce cluster
 -fs <local|namenode:port>     specify a namenode
 -jt <local|jobtracker:port>   specify a job tracker
 -libjars <paths>              comma separated jar files to include in the
                               classpath.
Job-Specific Options:                                                           
  --input (-i) input                           Path to job input directory.     
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --numClusters (-k) k                         The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --maxRed (-r) maxRed                         The number of reduce tasks.      
                                               Defaults to 2                    
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
  --method (-xm) method                        The execution method to use:     
                                               sequential or mapreduce. Default 
                                               is mapreduce                     
  --help (-h)                                  Print out help                   
  --tempDir tempDir                            Intermediate output directory    
  --startPhase startPhase                      First phase to run               
  --endPhase endPhase                          Last phase to run                
</snip>

Which, of course, shows that it isn't getting the arguments.  Perhaps it's the 
s3n:// paths?  I'm going to try running from ssh.

-Grant



On Sep 2, 2010, at 1:04 PM, Drew Farris wrote:

> Were there specific issues you ran into? I suspect the documentation
> on the wiki is out of date.
> 
> Drew
> 
> On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersoll <[email protected]> wrote:
>> Has anyone successfully run any of the clustering algorithms on Amazon's 
>> Elastic Map Reduce?  If so, please share steps please.
>> 
>> Thanks,
>> Grant

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Clustering on Elastic Map Reduce

Reply via email to