It says it right there -- text files *with the preference data*. This is a collaborative filtering tool, which is quite different from computing document similarity.
On Fri, Nov 26, 2010 at 8:25 AM, Divya <[email protected]> wrote: > Hi, > > But in java doc of ItemSimilarityJob its written that > "Dmapred.input.dir=(path): Directory containing one or more text files with > the preference data" > So I assumed that it may take text files also. > > Is there any way by which we can compute similarity between documents. > I explored Mahout but couldn't find anything. > > > Thanks > Regards, > Divya > > -----Original Message----- > From: Sebastian Schelter [mailto:[email protected]] > Sent: Friday, November 26, 2010 3:54 PM > To: [email protected] > Subject: Re: error in itemsimilarity > > ItemSimilarityJob can not be used to compute the similarity between text > documents. It's thought to be used for Collaborative Filtering as > described here: > > https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+F > iltering > > Am 26.11.2010 08:50, schrieb Divya: > > Hi, > > > > I am getting following exception when I try to run itemsimilarity from > CL. > > > > My input data is a text file which just has one line of text > > > > Can any one please help me in resolving the error. > > > > > > > > > > > > $ bin/mahout itemsimilarity -i D:/MahoutResult/ItemSimilarity/Input_Data > -o > > D:/MahoutResult/ItemSimilarity/Output -s DistributedUncen > > > > teredCosineVectorSimilarity.class > > > > Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > > > > HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > > > > 10/11/26 15:43:50 INFO common.AbstractJob: Command line arguments: > > {--booleanData=false, --endPhase=2147483647, --input=D:/MahoutResult > > > > /ItemSimilarity/Input_Data, --maxCooccurrencesPerItem=100, > > --maxSimilaritiesPerItem=100, > --output=D:/MahoutResult/ItemSimilarity/Output > > > > , > --similarityClassname=DistributedUncenteredCosineVectorSimilarity.class, > > --startPhase=0, --tempDir=temp} > > > > 10/11/26 15:43:51 INFO jvm.JvmMetrics: Initializing JVM Metrics with > > processName=JobTracker, sessionId= > > > > 10/11/26 15:43:52 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:53 INFO mapred.JobClient: Running job: job_local_0001 > > > > 10/11/26 15:43:53 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:53 INFO mapred.MapTask: io.sort.mb = 100 > > > > 10/11/26 15:43:53 INFO mapred.MapTask: data buffer = 79691776/99614720 > > > > 10/11/26 15:43:53 INFO mapred.MapTask: record buffer = 262144/327680 > > > > 10/11/26 15:43:53 WARN mapred.LocalJobRunner: job_local_0001 > > > > java.lang.ArrayIndexOutOfBoundsException: 1 > > > > at > > > > org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapp > > er.java:47) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapp > > er.java:31) > > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > > 10/11/26 15:43:54 INFO mapred.JobClient: map 0% reduce 0% > > > > 10/11/26 15:43:54 INFO mapred.JobClient: Job complete: job_local_0001 > > > > 10/11/26 15:43:54 INFO mapred.JobClient: Counters: 0 > > > > 10/11/26 15:43:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with > > processName=JobTracker, sessionId= - already initialized > > > > 10/11/26 15:43:55 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:55 INFO mapred.JobClient: Running job: job_local_0002 > > > > 10/11/26 15:43:55 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:56 INFO mapred.MapTask: io.sort.mb = 100 > > > > 10/11/26 15:43:56 INFO mapred.MapTask: data buffer = 79691776/99614720 > > > > 10/11/26 15:43:56 INFO mapred.MapTask: record buffer = 262144/327680 > > > > 10/11/26 15:43:56 WARN mapred.LocalJobRunner: job_local_0002 > > > > java.lang.NumberFormatException: For input string: "For a young person > who > > is years and above and below years he may be employed in an > > > > industrial undertaking His employer however is required to notify " > > > > at > > > > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48 > > ) > > > > at java.lang.Long.parseLong(Long.java:410) > > > > at java.lang.Long.parseLong(Long.java:468) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.CountUsersMapper.map(Count > > UsersMapper.java:40) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.CountUsersMapper.map(Count > > UsersMapper.java:31) > > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > > 10/11/26 15:43:56 INFO mapred.JobClient: map 0% reduce 0% > > > > 10/11/26 15:43:56 INFO mapred.JobClient: Job complete: job_local_0002 > > > > 10/11/26 15:43:56 INFO mapred.JobClient: Counters: 0 > > > > 10/11/26 15:43:56 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with > > processName=JobTracker, sessionId= - already initialized > > > > 10/11/26 15:43:57 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:57 INFO mapred.JobClient: Running job: job_local_0003 > > > > 10/11/26 15:43:57 INFO input.FileInputFormat: Total input paths to > process > : > > 2 > > > > 10/11/26 15:43:57 INFO mapred.MapTask: io.sort.mb = 100 > > > > 10/11/26 15:43:57 INFO mapred.MapTask: data buffer = 79691776/99614720 > > > > 10/11/26 15:43:57 INFO mapred.MapTask: record buffer = 262144/327680 > > > > 10/11/26 15:43:58 WARN mapred.LocalJobRunner: job_local_0003 > > > > java.lang.NumberFormatException: For input string: "For a young person > who > > is years and above and below years he may be employed in an > > > > industrial undertaking His employer however is required to notify " > > > > at > > > > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48 > > ) > > > > at java.lang.Long.parseLong(Long.java:410) > > > > at java.lang.Long.parseLong(Long.java:468) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.ToEntityPrefsMapper.map(ToEntityPrefsMappe > > r.java:57) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.ToEntityPrefsMapper.map(ToEntityPrefsMappe > > r.java:30) > > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > > 10/11/26 15:43:58 INFO mapred.JobClient: map 0% reduce 0% > > > > 10/11/26 15:43:58 INFO mapred.JobClient: Job complete: job_local_0003 > > > > 10/11/26 15:43:58 INFO mapred.JobClient: Counters: 0 > > > > 10/11/26 15:43:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with > > processName=JobTracker, sessionId= - already initialized > > > > 10/11/26 15:43:59 INFO input.FileInputFormat: Total input paths to > process > : > > 0 > > > > 10/11/26 15:43:59 INFO mapred.LocalJobRunner: > > > > 10/11/26 15:43:59 INFO mapred.JobClient: Running job: job_local_0004 > > > > 10/11/26 15:43:59 INFO input.FileInputFormat: Total input paths to > process > : > > 0 > > > > 10/11/26 15:43:59 WARN mapred.LocalJobRunner: job_local_0004 > > > > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > > > > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > > > > at java.util.ArrayList.get(ArrayList.java:322) > > > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) > > > > 10/11/26 15:44:00 INFO mapred.JobClient: map 0% reduce 0% > > > > 10/11/26 15:44:00 INFO mapred.JobClient: Job complete: job_local_0004 > > > > 10/11/26 15:44:00 INFO mapred.JobClient: Counters: 0 > > > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 > > > > at > > > > org.apache.mahout.cf.taste.hadoop.TasteHadoopUtils.readIntFromFile(TasteHado > > opUtils.java:103) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.run(Item > > SimilarityJob.java:187) > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > > > at > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.main(Ite > > mSimilarityJob.java:92) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > > ) > > > > at > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > > .java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > at > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > > .java:68) > > > > at > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > > > at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > > ) > > > > at > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > > .java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > > > > > > > > > Thanks > > > > Regards, > > > > Divya > > > > > > >
