[if you got this mail twice, please accept my apologies, but I didn't
receive it myself after posting]
Hi all,
I am trying to get the ItemSimilarityJob to work for me. I have a
standalone Hadoop setup and I am running the Job like this
~$ hadoop jar
/usr/local/mahout/core/target/mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
/user/ubuntu --output /output --similarityClassname SIMILARITY_COOCCURRENCE
In the HDFS folder /user/ubuntu, I have a somewhat large (2G) file with
records of the form:
[userID],[objectID],1
I can see hadoop starting on the job and mapping and reducing away for a
while, but then at some point it fails, logging:
13/05/23 17:21:10 INFO mapred.LocalJobRunner:
13/05/23 17:21:10 INFO mapred.MapTask: Finished spill 7
13/05/23 17:21:10 INFO mapred.MapTask: Starting flush of map output
13/05/23 17:21:12 INFO mapred.MapTask: Finished spill 8
13/05/23 17:21:12 INFO mapred.Merger: Merging 9 sorted segments
13/05/23 17:21:12 INFO mapred.Merger: Down to the last merge-pass, with 9
segmen
ts left of total size: 8563929 bytes
13/05/23 17:21:13 INFO mapred.LocalJobRunner:
13/05/23 17:21:13 INFO mapred.JobClient: map 100% reduce 0%
13/05/23 17:21:16 INFO mapred.LocalJobRunner:
13/05/23 17:21:19 INFO mapred.LocalJobRunner:
13/05/23 17:21:22 INFO mapred.LocalJobRunner:
13/05/23 17:21:25 INFO mapred.LocalJobRunner:
13/05/23 17:21:25 INFO mapred.Task: Task:attempt_local_0002_m_000029_0 is
done.
And is in the process of commiting
13/05/23 17:21:28 INFO mapred.LocalJobRunner:
13/05/23 17:21:28 INFO mapred.LocalJobRunner:
13/05/23 17:21:28 INFO mapred.Task: Task 'attempt_local_0002_m_000029_0'
done.
13/05/23 17:21:28 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache
.hadoop.util.LinuxResourceCalculatorPlugin@39c07f3a
13/05/23 17:21:28 INFO mapred.MapTask: io.sort.mb = 100
13/05/23 17:21:28 INFO mapred.MapTask: data buffer = 79691776/99614720
13/05/23 17:21:28 INFO mapred.MapTask: record buffer = 262144/327680
13/05/23 17:21:28 WARN mapred.LocalJobRunner: job_local_0002
java.io.FileNotFoundException: File does not exist: /user/ubuntu/temp
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja
va:1843)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java
:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst
em.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(Lin
eRecordReader.java:67)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(M
apTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:2
12)
13/05/23 17:21:29 INFO mapred.JobClient: Job complete: job_local_0002
13/05/23 17:21:29 INFO mapred.JobClient: Counters: 17
13/05/23 17:21:29 INFO mapred.JobClient: FileSystemCounters
13/05/23 17:21:29 INFO mapred.JobClient: FILE_BYTES_READ=28341894805
13/05/23 17:21:29 INFO mapred.JobClient: HDFS_BYTES_READ=90730532541
13/05/23 17:21:29 INFO mapred.JobClient: FILE_BYTES_WRITTEN=35182198348
13/05/23 17:21:29 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=841343250
13/05/23 17:21:29 INFO mapred.JobClient: File Input Format Counters
13/05/23 17:21:29 INFO mapred.JobClient: Bytes Read=1985044985
13/05/23 17:21:29 INFO mapred.JobClient: Map-Reduce Framework
13/05/23 17:21:29 INFO mapred.JobClient: Map output materialized
bytes=43663
5087
13/05/23 17:21:29 INFO mapred.JobClient: Combine output records=0
13/05/23 17:21:29 INFO mapred.JobClient: Map input records=117838687
13/05/23 17:21:29 INFO mapred.JobClient: Physical memory (bytes)
snapshot=0
13/05/23 17:21:29 INFO mapred.JobClient: Spilled Records=289154604
13/05/23 17:21:29 INFO mapred.JobClient: Map output bytes=1300200048
13/05/23 17:21:29 INFO mapred.JobClient: CPU time spent (ms)=0
13/05/23 17:21:29 INFO mapred.JobClient: Total committed heap usage
(bytes)=
3796746240
13/05/23 17:21:29 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=0
13/05/23 17:21:29 INFO mapred.JobClient: Combine input records=0
13/05/23 17:21:29 INFO mapred.JobClient: Map output records=117838687
13/05/23 17:21:29 INFO mapred.JobClient: SPLIT_RAW_BYTES=3240
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: /
user/ubuntu/temp/prepareRatingMatrix/numUsers.bin
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja
va:1843)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java
:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst
em.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.mahout.common.HadoopUtil.readInt(HadoopUtil.java:290)
at
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.r
un(ItemSimilarityJob.java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.m
ain(ItemSimilarityJob.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
The HDFS file system at this point contains this: (my input file is
/user/ubuntu/input.txt)
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 13:16 /user
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 16:21 /user/ubuntu
-rw-r--r-- 3 ubuntu supergroup 1984926172 2013-05-23 13:06
/user/ubuntu/input.txt
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 16:21
/user/ubuntu/temp
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 16:47
/user/ubuntu/temp/prepareRatingMatrix
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 16:47
/user/ubuntu/temp/prepareRatingMatrix/itemIDIndex
-rw-r--r-- 3 ubuntu supergroup 0 2013-05-23 16:47
/user/ubuntu/temp/prepareRatingMatrix/itemIDIndex/_SUCCESS
-rw-r--r-- 3 ubuntu supergroup 28044775 2013-05-23 16:47
/user/ubuntu/temp/prepareRatingMatrix/itemIDIndex/part-r-00000
drwxr-xr-x - ubuntu supergroup 0 2013-05-23 17:21
/user/ubuntu/temp/prepareRatingMatrix/userVectors
I have retried with identical reslts and I have tried a different
similarityClassname with the same results. Where am I going wrong here?
Thanks in advance for any pointers,
Teun