Hi Jake, Thanks very much for the help. I looked into the problem a little deeper and found that the org.apache.mahout.utils.vectors.lucene.Driver was writing out LongWriters instead of IntWriters so I just changed the code in there. Should this code be using IntWriters or LongWriters?
I managed to get the similarity matrix to be written to disk but I'm not at all sure about the results. My original input was 3 solr documents: id1: A A B C id2: B D D id3: A B B E After writing the to a sequence file and running your matrix transposition and multiplication, I get an output called part-0000. If I read it using $ mahout seqdumper --seqFile part-00000 then it outputs: Input Path: part-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: org.apache.mahout.math.vectorwrita...@288051 Key: 1: Value: org.apache.mahout.math.vectorwrita...@288051 Key: 2: Value: org.apache.mahout.math.vectorwrita...@288051 Count: 3 Is this what is to be expected? Thanks, Kris 2010/6/10 Jake Mannix <[email protected]> > Yeah, you simply can't cast between IntWritable and LongWritable, sadly. > You need to convert your Long document ids to Integer. Since you're > pulling > documents from Solr, the docIds should be sequential and start small, > in which case they're all well under Integer.MAX_VALUE, and so a trivial > MapReduce (well, Map, no Reduce) job with a Mapper like this should work: > > public class M extends Mapper<LongWritable, Writable, IntWritable, > Writable> > { > private final IntWritable i = new IntWritable(0); > public void map(LongWritable key, Writable value, Context c) > { > i.set((int)k.get()); > c.collect(i, value); > } > } > > Run that over your input first, and you should be set. > > -jake > > On Thu, Jun 10, 2010 at 7:20 AM, Kris Jack <[email protected]> wrote: > > > Got a little further by making some more class changes... > > > > // > > public class GenSimMatrixJob extends AbstractJob { > > > > public GenSimMatrixJob() { > > > > } > > > > @Override > > public int run(String[] strings) throws Exception { > > addOption("numDocs", "nd", "Number of documents in the input"); > > addOption("numTerms", "nt", "Number of terms in the input"); > > > > Map<String,String> parsedArgs = parseArguments(strings); > > if (parsedArgs == null) { > > // FIXME > > return 0; > > } > > > > Configuration originalConf = getConf(); > > String inputPathString = originalConf.get("mapred.input.dir"); > > String outputTmpPathString = parsedArgs.get("--tempDir"); > > int numDocs = Integer.parseInt(parsedArgs.get("--numDocs")); > > int numTerms = Integer.parseInt(parsedArgs.get("--numTerms")); > > > > DistributedRowMatrix text = new > > DistributedRowMatrix(inputPathString, > > outputTmpPathString, numDocs, numTerms); > > > > text.configure(new JobConf(getConf())); > > > > DistributedRowMatrix transpose = text.transpose(); > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > > System.out.println("Similarity matrix lives: " + > > similarity.getRowPath()); > > > > return 1; > > } > > > > public static void main(String[] args) throws Exception { > > ToolRunner.run(new GenSimMatrixJob(), args); > > } > > > > } > > // > > > > Giving the error... > > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > further > > details. > > 10-Jun-2010 15:16:28 org.apache.hadoop.metrics.jvm.JvmMetrics init > > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > > configureCommandLineOptions > > WARNING: Use GenericOptionsParser for parsing the arguments. Applications > > should implement Tool for the same. > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > > configureCommandLineOptions > > WARNING: No job jar file set. User classes may not be found. See > > JobConf(Class) or JobConf#setJar(String). > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus > > INFO: Total input paths to process : 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Running job: job_local_0001 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.FileInputFormat listStatus > > INFO: Total input paths to process : 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.util.NativeCodeLoader <clinit> > > WARNING: Unable to load native-hadoop library for your platform... using > > builtin-java classes where applicable > > 10-Jun-2010 15:16:28 org.apache.hadoop.io.compress.CodecPool > > getDecompressor > > INFO: Got brand-new decompressor > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask runOldMapper > > INFO: numReduceTasks: 1 > > 10-Jun-2010 15:16:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > <init> > > INFO: io.sort.mb = 100 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > <init> > > INFO: data buffer = 79691776/99614720 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > <init> > > INFO: record buffer = 262144/327680 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.LocalJobRunner$Job run > > WARNING: job_local_0001 > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be > > cast to org.apache.hadoop.io.IntWritable > > at > > > > > org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: map 0% reduce 0% > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > > INFO: Job complete: job_local_0001 > > 10-Jun-2010 15:16:29 org.apache.hadoop.mapred.Counters log > > INFO: Counters: 0 > > > > > > > > 2010/6/10 Kris Jack <[email protected]> > > > > > In the attempt to create a document-document similarity matrix, I am > > > getting the following error: > > > > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > > further > > > details. > > > 10-Jun-2010 13:25:04 org.apache.hadoop.metrics.jvm.JvmMetrics init > > > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient > > > configureCommandLineOptions > > > WARNING: Use GenericOptionsParser for parsing the arguments. > Applications > > > should implement Tool for the same. > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.JobClient > > > configureCommandLineOptions > > > WARNING: No job jar file set. User classes may not be found. See > > > JobConf(Class) or JobConf#setJar(String). > > > 10-Jun-2010 13:25:04 org.apache.hadoop.mapred.FileInputFormat > listStatus > > > INFO: Total input paths to process : 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: Running job: job_local_0001 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.FileInputFormat > listStatus > > > INFO: Total input paths to process : 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.util.NativeCodeLoader <clinit> > > > WARNING: Unable to load native-hadoop library for your platform... > using > > > builtin-java classes where applicable > > > 10-Jun-2010 13:25:05 org.apache.hadoop.io.compress.CodecPool > > > getDecompressor > > > INFO: Got brand-new decompressor > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask runOldMapper > > > INFO: numReduceTasks: 1 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > <init> > > > INFO: io.sort.mb = 100 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > <init> > > > INFO: data buffer = 79691776/99614720 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.MapTask$MapOutputBuffer > > > <init> > > > INFO: record buffer = 262144/327680 > > > 10-Jun-2010 13:25:05 org.apache.hadoop.mapred.LocalJobRunner$Job run > > > WARNING: job_local_0001 > > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > be > > > cast to org.apache.hadoop.io.IntWritable > > > at > > > > > > org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1) > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > > at > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: map 0% reduce 0% > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > > INFO: Job complete: job_local_0001 > > > 10-Jun-2010 13:25:06 org.apache.hadoop.mapred.Counters log > > > INFO: Counters: 0 > > > Exception in thread "main" java.lang.RuntimeException: > > java.io.IOException: > > > Job failed! > > > at > > > > > > org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:163) > > > at > > > > > > org.apache.mahout.math.hadoop.GenSimMatrixLocal.generateMatrix(GenSimMatrixLocal.java:24) > > > at > > > > > > org.apache.mahout.math.hadoop.GenSimMatrixLocal.main(GenSimMatrixLocal.java:34) > > > Caused by: java.io.IOException: Job failed! > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > > > at > > > > > > org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:158) > > > ... 2 more > > > > > > > > > I created a test solr index with 3 documents and generated a sparse > > feature > > > matrix out of it using mahout's > > > org.apache.mahout.utils.vectors.lucene.Driver. > > > > > > I then ran the following code using the sparse feature matrix as input > > > (mahoutIndexTFIDF.vec). > > > > > > { > > > private void generateMatrix() { > > > String inputPath = "/home/kris/data/mahoutIndexTFIDF.vec"; > > > String tmpPath = "/tmp/matrixMultiplySpace"; > > > int numDocuments = 3; > > > int numTerms = 4; > > > > > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > > > tmpPath, numDocuments, numTerms); > > > > > > JobConf conf = new JobConf("similarity job"); > > > text.configure(conf); > > > > > > DistributedRowMatrix transpose = text.transpose(); > > > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > > > > System.out.println("Similarity matrix lives: " + > > > similarity.getRowPath()); > > > } > > > > > > public static void main (String [] args) { > > > GenSimMatrixLocal similarity = new GenSimMatrixLocal(); > > > > > > similarity.generateMatrix(); > > > } > > > } > > > > > > Anyone see why there is a problem between LongWritable and IntWritable > > > casting? Does it need to be configured differently? > > > > > > Thanks, > > > Kris > > > > > > > > > > > > > > > > > > -- > > Dr Kris Jack, > > http://www.mendeley.com/profiles/kris-jack/ > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
