Re: Generating a Document Similarity Matrix

Sebastian Schelter Fri, 02 Jul 2010 12:33:35 -0700

Hi Kris,

I think the best way would be to manually join the names to the result
after executing the job.


--sebastian

Am 02.07.2010 18:22, schrieb Kris Jack:
> Hi Sebastian,
>
> I am currently using your code with NamedVectors in my input.  In the
> output, however, the names seem to be missing.  Would there be a way to
> include them?
>
> Thanks,
> Kris
>
>
>
> 2010/6/29 Sebastian Schelter <[email protected]>
>
>   
>> Hi Kris,
>>
>> I'm glad I could help you and it's really cool that you are testing my
>> patches on real data. I'm looking forward to hearing more!
>>
>> -sebastian
>>
>> Am 29.06.2010 11:25, schrieb Kris Jack:
>>     
>>> Hi Sebastian,
>>>
>>> You really are very kind!  I have taken your code and run it to print out
>>> the contents of the output file.  There are indeed only 37,952 results so
>>> that gives me more confidence in the vector dumper.  I'm not sure why
>>>       
>> there
>>     
>>> was a memory problem though, seeing as it seems to have output the
>>>       
>> results
>>     
>>> correctly.  Now I just have to match them up with my original lucene ids
>>>       
>> and
>>     
>>> see how it is performing.  I'll keep you posted with the results.
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/28 Sebastian Schelter <[email protected]>
>>>
>>>
>>>       
>>>> Hi Kris,
>>>>
>>>> Unfortunately I'm not familiar with the VectorDumper code (and a quick
>>>> look didn't help either), so I can't help you with the OutOfMemoryError.
>>>>
>>>> It could be possible that only 37,952 results are found for an input of
>>>> 500,000 vectors, it really depends on the actual data. If you're sure
>>>> that there should be more results, you could provide me with a sample
>>>> input file and I'll try to find out why there aren't more results.
>>>>
>>>> I wrote a small class for you that dumps the output file of the job to
>>>> the console, (I tested it with the output of my unit-tests), maybe that
>>>> can help us find the source of the problem.
>>>>
>>>> -sebastian
>>>>
>>>> public class MatrixReader extends AbstractJob {
>>>>
>>>>  public static void main(String[] args) throws Exception {
>>>>    ToolRunner.run(new MatrixReader(), args);
>>>>  }
>>>>
>>>>  @Override
>>>>  public int run(String[] args) throws Exception {
>>>>
>>>>    addInputOption();
>>>>
>>>>    Map<String,String> parsedArgs = parseArguments(args);
>>>>    if (parsedArgs == null) {
>>>>      return -1;
>>>>    }
>>>>
>>>>    Configuration conf = getConf();
>>>>    FileSystem fs = FileSystem.get(conf);
>>>>
>>>>    Path vectorFile = fs.listStatus(getInputPath(),
>>>> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>>>>
>>>>    SequenceFile.Reader reader = null;
>>>>    try {
>>>>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
>>>>      IntWritable key = new IntWritable();
>>>>      VectorWritable value = new VectorWritable();
>>>>
>>>>      while (reader.next(key, value)) {
>>>>        int row = key.get();
>>>>        System.out.print(String.valueOf(key.get()) +  ": ");
>>>>        Iterator<Element> elementsIterator =
>>>>         
>> value.get().iterateNonZero();
>>     
>>>>        String separator = "";
>>>>        while (elementsIterator.hasNext()) {
>>>>          Element element = elementsIterator.next();
>>>>          System.out.print(separator + String.valueOf(element.index()) +
>>>> "," + String.valueOf(element.get()));
>>>>          separator = ";";
>>>>        }
>>>>        System.out.print("\n");
>>>>      }
>>>>    } finally {
>>>>      reader.close();
>>>>    }
>>>>    return 0;
>>>>  }
>>>> }
>>>>
>>>> Am 28.06.2010 17:18, schrieb Kris Jack:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I am now using the version of
>>>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
>>>>>           
>> Sebastian
>>     
>>>>>           
>>>> has
>>>>
>>>>         
>>>>> written and has been added to the trunk.  Thanks again for that!  I can
>>>>> generate an output file that should contain a list of documents with
>>>>>
>>>>>           
>>>> their
>>>>
>>>>         
>>>>> top 100* *most similar documents.  I am having problems, however, in
>>>>> converting the output file into a readable format using mahout's
>>>>>
>>>>>           
>>>> vectordump:
>>>>
>>>>         
>>>>> $ ./mahout vectordump --seqFile similarRows --output results.out
>>>>>
>>>>>           
>>>> --printKey
>>>>
>>>>         
>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>> Input Path: /home/kris/similarRows
>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>>     
>>>>         
>>>>>     at
>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>           
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>>     
>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     
>>>>         
>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>
>>>>         
>>>>>     at
>>>>>           
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>>     
>>>>> What is this doing that takes up so much memory?  A file is produced
>>>>>           
>> with
>>     
>>>>> 37,952 readable rows but I'm expecting more like 500,000 results, since
>>>>>           
>> I
>>     
>>>>> have this number of documents.  Should I be using something else to
>>>>>           
>> read
>>     
>>>>>           
>>>> the
>>>>
>>>>         
>>>>> output file of the RowSimilarityJob?
>>>>>
>>>>> Thanks,
>>>>> Kris
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/18 Sebastian Schelter <[email protected]>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi Kris,
>>>>>>
>>>>>> maybe you want to give the patch from
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
>>>>>>             
>> yet
>>     
>>>>>> tested it with larger data yet, but I would be happy to get some
>>>>>> feedback for it and maybe it helps you with your usecase.
>>>>>>
>>>>>> -sebastian
>>>>>>
>>>>>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Thanks Ted,
>>>>>>>
>>>>>>> I got that working.  Unfortunately, the matrix multiplication job is
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> taking
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> far longer than I hoped.  With just over 10 million documents, 10
>>>>>>>
>>>>>>>               
>>>> mappers
>>>>
>>>>         
>>>>>>> and 10 reducers, I can't get it to complete the job in under 48
>>>>>>>               
>> hours.
>>     
>>>>>>> Perhaps you have an idea for speeding it up?  I have already been
>>>>>>>               
>> quite
>>     
>>>>>>> ruthless with making the vectors sparse.  I did not include terms
>>>>>>>               
>> that
>>     
>>>>>>> appeared in over 1% of the corpus and only kept terms that appeared
>>>>>>>               
>> at
>>     
>>>>>>>
>>>>>>>               
>>>>>> least
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> 50 times.  Is it normal that the matrix multiplication map reduce
>>>>>>>               
>> task
>>     
>>>>>>> should take so long to process with this quantity of data and
>>>>>>>               
>> resources
>>     
>>>>>>> available or do you think that my system is not configured properly?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/6/15 Ted Dunning <[email protected]>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Threshold are generally dangerous.  It is usually preferable to
>>>>>>>>
>>>>>>>>                 
>>>> specify
>>>>
>>>>         
>>>>>>>>                 
>>>>>> the
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
>>>>>>>>
>>>>>>>>                 
>>>> descending
>>>>
>>>>         
>>>>>>>> score order using Hadoop's builtin capabilities and just drop the
>>>>>>>>
>>>>>>>>                 
>>>> rest.
>>>>
>>>>         
>>>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>  I was wondering if there was an
>>>>>>>>> interesting way to do this with the current mahout code such as
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> requesting
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> that the Vector accumulator returns only elements that have values
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> greater
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> than a given threshold, sorting the vector by value rather than
>>>>>>>>>                   
>> key,
>>     
>>>>>>>>>                   
>>>> or
>>>>
>>>>         
>>>>>>>>> something else?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>                 
>>>>>>>               
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>

Re: Generating a Document Similarity Matrix

Reply via email to