The example should work, I tested it yesterday. The simplest way to execute it is to first build mahout using
$ mvn -DskipTests clean install Then download the movielens1M dataset from http://www.grouplens.org/node/73 and unzip it. After that, go to examples/bin and point the script to the ratings.dat file found in the movielens dataset. $ export MAHOUT_LOCAL=true $ bash factorize-movielens-1M.sh /path/to/ratings.dat Best, Sebastian On 19.01.2013 00:20, Kamal Ali wrote: > I'm a newbie trying to get some mahout commandline examples to work. > > I tried executing factorize-movielens-1M.sh but get an error "input path > does not exist: /tmp/mahout-work-kali/movielens/ratings.csv" > even after i manually created /tmp/mahout-work-ali/ and all its descendant > directories and chmod'd them to 777. > > even after i modified factorize-movielens-1M.sh to do a "ls -l " on the > ratings.csv which show /tmp/mahout-work-kali/movielens/ratings.csv > exists. > > [the input file u1.base already has "::" instead of \t as delimiters.] > > i'm wondering if the error is something else and is being mis-reported and > some intermediate script/program is just getting a non-zero > return status and falling back on a stock error message. > > i am on 64bit mac, jdk1.7. my ssh keys were generated using user "kali". > > has anyone had success running factorize-movielens-1M.sh ? > > does this factorize*sh only run in mahout local mode ? > > is factorize-movielens-1M.sh cruddy and old and some other way > should be used?? > > i'm primarily interested in getting ALS methods to work, > if someone knows where in the mahout distribution one can find the > latest or most tested ALS implementation (and the maven command to run it) > pls let me know . > > THANK YOU! > kamal. > > my hadoop-env.sh is at the end of this email. > ================================================ > ./factorize-movielens-1M.sh $grouplens/ml-100k/u1.base # grouplens > points to a directory containing the file u1.base > creating work directory at /tmp/mahout-work-kali > kamal: doing ls -l on movie lens dir: > total 1544 > drwxrwxrwx 3 kali wheel 102 Jan 18 12:20 dataset > -rwxrwxrwx 1 kali wheel 786544 Jan 18 13:46 ratings.csv > kamal: doing wc -l on ratings.csv > 80000 /tmp/mahout-work-kali/movielens/ratings.csv > Converting ratings... > after sed > -rwxrwxrwx 1 kali wheel 786544 Jan 18 13:47 > /tmp/mahout-work-kali/movielens/ratings.csv > kamal: doing head on ratings.csv > 1,1,5 > 1,2,3 > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Warning: $HADOOP_HOME is deprecated. > > Running on hadoop, using /Users/kali/hadoop/hadoop-1.0.4/bin/hadoop and > HADOOP_CONF_DIR=/Users/kali/hadoop/hadoop-1.0.4/conf > MAHOUT-JOB: > /users/kali/mahout/mahout0.7/examples/target/mahout-examples-0.7-job.jar > Warning: $HADOOP_HOME is deprecated. > > 13/01/18 13:47:24 INFO common.AbstractJob: Command line arguments: > {--endPhase=[2147483647], > --input=[/tmp/mahout-work-kali/movielens/ratings.csv], > --output=[/tmp/mahout-work-kali/dataset], --probePercentage=[0.1], > --startPhase=[0], --tempDir=[/tmp/mahout-work-kali/dataset/tmp], > --trainingPercentage=[0.9]} > 2013-01-18 13:47:24.918 java[53562:1703] Unable to load realm info from > SCDynamicStore > 13/01/18 13:47:25 INFO mapred.JobClient: Cleaning up the staging area > hdfs://localhost:9000/tmp/hadoop-kali/mapred/staging/kali/.staging/job_201301151900_0035 > 13/01/18 13:47:25 ERROR security.UserGroupInformation: > PriviledgedActionException as:kali > cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: /tmp/mahout-work-kali/movielens/ratings.csv > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: /tmp/mahout-work-kali/movielens/ratings.csv > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979) > at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) > at > org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter.run(DatasetSplitter.java:90) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at > org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter.main(DatasetSplitter.java:64) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > after splitDataset > -rwxrwxrwx 1 kali wheel 786544 Jan 18 13:47 > /tmp/mahout-work-kali/movielens/ratings.csv > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Warning: $HADOOP_HOME is deprecated. > > Running on hadoop, using /Users/kali/hadoop/hadoop-1.0.4/bin/hadoop and > HADOOP_CONF_DIR=/Users/kali/hadoop/hadoop-1.0.4/conf > MAHOUT-JOB: > /users/kali/mahout/mahout0.7/examples/target/mahout-examples-0.7-job.jar > Warning: $HADOOP_HOME is deprecated. > > 13/01/18 13:47:31 INFO common.AbstractJob: Command line arguments: > {--alpha=[40], --endPhase=[2147483647], --implicitFeedback=[false], > --input=[/tmp/mahout-work-kali/dataset/trainingSet/], --lambda=[0.065], > --numFeatures=[20], --numIterations=[10], > --output=[/tmp/mahout-work-kali/als/out], --startPhase=[0], > --tempDir=[/tmp/mahout-work-kali/als/tmp]} > 2013-01-18 13:47:31.259 java[53605:1703] Unable to load realm info from > SCDynamicStore > 13/01/18 13:47:32 INFO mapred.JobClient: Cleaning up the staging area > hdfs://localhost:9000/tmp/hadoop-kali/mapred/staging/kali/.staging/job_201301151900_0036 > 13/01/18 13:47:32 ERROR security.UserGroupInformation: > PriviledgedActionException as:kali > cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: /tmp/mahout-work-kali/dataset/trainingSet > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: /tmp/mahout-work-kali/dataset/trainingSet > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979) > at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) > at > org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.run(ParallelALSFactorizationJob.java:137) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at > org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.main(ParallelALSFactorizationJob.java:98) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Warning: $HADOOP_HOME is deprecated. > > Running on hadoop, using /Users/kali/hadoop/hadoop-1.0.4/bin/hadoop and > HADOOP_CONF_DIR=/Users/kali/hadoop/hadoop-1.0.4/conf > MAHOUT-JOB: > /users/kali/mahout/mahout0.7/examples/target/mahout-examples-0.7-job.jar > Warning: $HADOOP_HOME is deprecated. > > 13/01/18 13:47:38 INFO common.AbstractJob: Command line arguments: > {--endPhase=[2147483647], > --input=[/tmp/mahout-work-kali/dataset/probeSet/], > --itemFeatures=[/tmp/mahout-work-kali/als/out/M/], > --output=[/tmp/mahout-work-kali/als/rmse/], --startPhase=[0], > --tempDir=[/tmp/mahout-work-kali/als/tmp], > --userFeatures=[/tmp/mahout-work-kali/als/out/U/]} > 2013-01-18 13:47:38.142 java[53645:1703] Unable to load realm info from > SCDynamicStore > 13/01/18 13:47:38 INFO mapred.JobClient: Cleaning up the staging area > hdfs://localhost:9000/tmp/hadoop-kali/mapred/staging/kali/.staging/job_201301151900_0037 > 13/01/18 13:47:38 ERROR security.UserGroupInformation: > PriviledgedActionException as:kali > cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: /tmp/mahout-work-kali/dataset/probeSet > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: /tmp/mahout-work-kali/dataset/probeSet > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979) > at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) > at > org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator.run(FactorizationEvaluator.java:91) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at > org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator.main(FactorizationEvaluator.java:68) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Warning: $HADOOP_HOME is deprecated. > > Running on hadoop, using /Users/kali/hadoop/hadoop-1.0.4/bin/hadoop and > HADOOP_CONF_DIR=/Users/kali/hadoop/hadoop-1.0.4/conf > MAHOUT-JOB: > /users/kali/mahout/mahout0.7/examples/target/mahout-examples-0.7-job.jar > Warning: $HADOOP_HOME is deprecated. > > 13/01/18 13:47:44 INFO common.AbstractJob: Command line arguments: > {--endPhase=[2147483647], > --input=[/tmp/mahout-work-kali/als/out/userRatings/], > --itemFeatures=[/tmp/mahout-work-kali/als/out/M/], --maxRating=[5], > --numRecommendations=[6], > --output=[/tmp/mahout-work-kali/recommendations/], --startPhase=[0], > --tempDir=[temp], --userFeatures=[/tmp/mahout-work-kali/als/out/U/]} > 2013-01-18 13:47:44.859 java[53687:1703] Unable to load realm info from > SCDynamicStore > 13/01/18 13:47:45 INFO mapred.JobClient: Cleaning up the staging area > hdfs://localhost:9000/tmp/hadoop-kali/mapred/staging/kali/.staging/job_201301151900_0038 > 13/01/18 13:47:45 ERROR security.UserGroupInformation: > PriviledgedActionException as:kali > cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: /tmp/mahout-work-kali/als/out/userRatings > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: /tmp/mahout-work-kali/als/out/userRatings > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979) > at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) > at > org.apache.mahout.cf.taste.hadoop.als.RecommenderJob.run(RecommenderJob.java:95) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at > org.apache.mahout.cf.taste.hadoop.als.RecommenderJob.main(RecommenderJob.java:69) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > RMSE is: > > cat: /tmp/mahout-work-kali/als/rmse/rmse.txt: No such file or directory > > > > Sample recommendations: > > cat: /tmp/mahout-work-kali/recommendations/part-m-00000: No such file or > directory > > > ================================================== > # Set Hadoop-specific environment variables here. > > # The only required environment variable is JAVA_HOME. All others are > # optional. When running a distributed configuration it is best to > # set JAVA_HOME in this file, so that it is correctly defined on > # remote nodes. > > # The java implementation to use. Required. > export > JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_10.jdk/Contents/Home/jre > > # Extra Java CLASSPATH elements. Optional. > # export HADOOP_CLASSPATH= > > # The maximum amount of heap to use, in MB. Default is 1000. > # export HADOOP_HEAPSIZE=2000 > > # Extra Java runtime options. Empty by default. > # export HADOOP_OPTS=-server > > # Command specific options appended to HADOOP_OPTS when specified > export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote > $HADOOP_NAMENODE_OPTS" > export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote > $HADOOP_SECONDARYNAMENODE_OPTS" > export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote > $HADOOP_DATANODE_OPTS" > export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote > $HADOOP_BALANCER_OPTS" > export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote > $HADOOP_JOBTRACKER_OPTS" > # export HADOOP_TASKTRACKER_OPTS= > # The following applies to multiple commands (fs, dfs, fsck, distcp etc) > # export HADOOP_CLIENT_OPTS > > # Extra ssh options. Empty by default. > # export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR" > > # Where log files are stored. $HADOOP_HOME/logs by default. > # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs > > # File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default. > # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves > > # host:path where hadoop code should be rsync'd from. Unset by default. > # export HADOOP_MASTER=master:/home/$USER/src/hadoop > > # Seconds to sleep between slave commands. Unset by default. This > # can be useful in large clusters, where, e.g., slave rsyncs can > # otherwise arrive faster than the master can service them. > # export HADOOP_SLAVE_SLEEP=0.1 > > # The directory where pid files are stored. /tmp by default. > # export HADOOP_PID_DIR=/var/hadoop/pids > > # A string representing this instance of hadoop. $USER by default. > # export HADOOP_IDENT_STRING=$USER > > # The scheduling priority for daemon processes. See 'man nice'. > # export HADOOP_NICENESS=10 >
