Re: RowSimilarity

Ken Krugler Tue, 17 Jul 2012 20:22:41 -0700

Hi Pat,

On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:


> Intersting.
> 
> I have another requirement, which is to do something like real time vector 
> based queries. Imagine taking a doc vector, reweighting some terms then doing 
> a query with it, perhaps in a truncated form. There are several ways to do 
> this but only solr would offer something real time results afaik. It looks 
> like I could use your approach below to do this. A quick look at eDisMax 
> however suggests some problems. The use of pf2 and pf3 would jamb the query 
> vector into synthesized bi and tri grams for instance.

The simplistic approach I used was to extract the top 50 terms (with TF*IDF 
weights) from the target document, then use those terms (with weights as 
boosts) to do a regular Lucene OR query & request the top 20 hits.

The index I'm searching against has Solr documents with a multi-value field 
that contains the top 50 terms, generated using the same approach as with the 
target document. It also contains stored weights for each of those terms.

I didn't use payload boosting, but could have to improve the quality of this 
search - seemed to be working well enough, and speed was pretty important.

Solr returns back a sorted list of hits, and then I do a regular vector 
similarity calculation between the target & each of these top 20 hits, and 
select the best one (assuming it passes a similarity threshold).

> I'd be interested in hearing more about how you use it. Is there a better 
> venue than the mahout list?

If you'd like more details, that's probably better for an off-list 
discussion…doesn't feel very Mahout-ish in nature :)

Though a discussion of the major problem (how to extract "good" terms from the 
text) would be very interesting, as I wound up crafting what felt like a kludgy 
pseudo-NLP solution.

-- Ken

> 
> On 7/13/12 9:41 PM, Ken Krugler wrote:
>> Hi Pat,
>> 
>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:
>> 
>>> I also do clustering so that's an obvious optimization I just haven't 
>>> gotten to it yet (doing similar only on docs clustered together). I'm also 
>>> trying to decide how to downsample. However the results from similarity are 
>>> quite good so understanding how to scale is #1.
>>> 
>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs 
>>> similar to each docs.
>>> 
>>> What I really need is to calculate the k most similar docs to a short list, 
>>> known ahead of time. I don't know of an algorithm to do this (other than 
>>> brute force). It would take a realatively small set of docs and find 
>>> similar docs in a much much larger set. Rowsimilarity finds all pair-wise 
>>> similarities. Strictly speaking I need only a tiny number of those.
>>> 
>>> I think lucene has a weighted verctor based search that I need to 
>>> investigate it further.
>> As one point of reference, I've used Solr (Lucene) to do this, by taking the 
>> set of features (small, heavily reduced) from the target doc, using them 
>> (with weights) via edismax to find some top N candidate documents in the 
>> Lucene index which I'd built using the same approach (small set of 
>> features), and then calculating pair-wise similarity to rank the results.
>> 
>> -- Ken
>> 
>>> On 7/13/12 9:32 AM, Sebastian Schelter wrote:
>>>> Pat,
>>>> 
>>>> RowSimilarityJob compares all pairs of rows, which is by definition a
>>>> quadratic and therefore non-scalable problem. The comparison is however
>>>> done in a way that only rows that have at least one non-zero value in a
>>>> common dimension are compared.
>>>> 
>>>> Therefore if you have certain sparse types of input such as ratings for
>>>> example, you only have to look at a relatively small number of pairs and
>>>> the comparison scales.
>>>> 
>>>> RowSimilarityJob is mainly used for the collaborative filtering stuff in
>>>> Mahout. We have a special job to prepare the data
>>>> (PreparePreferenceMatrixJob) that will take care of sampling down
>>>> entries in the rating matrix that might cause too much cooccurrences.
>>>> 
>>>> If you directly use RowSimilarityJob, you have to ensure that your input
>>>> data is of a shape suitable for the job. It seems to me that this is not
>>>> the case, you created 76GB of intermediate output (cooccurring terms)
>>>> from 35k documents, its clear that it takes hadoop a long time to sort
>>>> that in the shuffle phase.
>>>> 
>>>> My advice would be that you either take a deeper look at your data and
>>>> try to downsample highly frequent terms more, or that you take a look at
>>>> other techniques such as clustering or locality sensitive hashing to
>>>> find similar documents.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> 
>>>> 
>>>> On 13.07.2012 18:03, Pat Ferrel wrote:
>>>>> I increased the timeout to 100 minutes and added another machine (does
>>>>> the new machine matter in this case?). The job completed successfully.
>>>>> 
>>>>> You say the algorithm is non-scalable--did you mean it's not
>>>>> parallelizable? I assume I'll need to keep increasing this limit?
>>>>> 
>>>>> I'm sure you know better than I that it is not really good for the
>>>>> efficiency of a cluster to increase the timeout so far since it means
>>>>> jobs can take much longer in the case of transient task failures.
>>>>> 
>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote:
>>>>>> OK, thanks. I haven't checked for sparsity. However I have many
>>>>>> successful runs of rowsimilarity with up to 150,000 docs and 250,000
>>>>>> terms as I said below. This run has a much smaller matrix. I
>>>>>> understand that spasity is a different question but anyway since the
>>>>>> data in all cases is a crawl of the same sites I'd expect the same
>>>>>> sparsity in all the data sets whether they succeeded or timed out.
>>>>>> 
>>>>>> My issue has nothing to do with the elapsed time although I'll have to
>>>>>> consider it in larger data sets (thanks for the heads up). Is it
>>>>>> impossible to check in with the task tracker, avoiding a timeout? Or
>>>>>> is there some other issue?
>>>>>> 
>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote:
>>>>>>> It's important to note that the performance of RowSimilarityJob
>>>>>>> heavily depends on the sparsity of the input data, because in general
>>>>>>> comparing all pairs of things is a quadratic (non-scalable) problem.
>>>>>>> 
>>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>:
>>>>>>>> Sorry, I overread that its more than one machine. Could you provide
>>>>>>>> the values for the counters from RowSimilarityJob (ROWS,
>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)?
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
>>>>>>>>> Thanks, actually there are two machines. I am testing before
>>>>>>>>> spending on
>>>>>>>>> AWS. It's failing the test in this case.
>>>>>>>>> 
>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms with a
>>>>>>>>> much
>>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the
>>>>>>>>> time and
>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size.
>>>>>>>>> Small runs
>>>>>>>>> work fine on my laptop.
>>>>>>>>> 
>>>>>>>>> I smell some kind of other problem than simple performance. In any
>>>>>>>>> case in a
>>>>>>>>> perfect world isn't the code supposed to check in often enough so the
>>>>>>>>> cluster config doesn't need to be tweaked for a specific job?
>>>>>>>>> 
>>>>>>>>> It may be some problem of mine, of course. I see no obvious hadoop
>>>>>>>>> or mahout
>>>>>>>>> errors but there are many places to look.
>>>>>>>>> 
>>>>>>>>> With a 100 minute timeout I am currently at the pause between map and
>>>>>>>>> reduce. If it fails would you like any specific logs?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
>>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a Hadoop
>>>>>>>>>> job on a single machine. It's pretty obvious that you will get
>>>>>>>>>> terrible performance.
>>>>>>>>>> 
>>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
>>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9 hours
>>>>>>>>>>> before
>>>>>>>>>>> each
>>>>>>>>>>> failure. This causes the job to take (after three tries) 27 hrs to
>>>>>>>>>>> completely fail. Oh, bother...
>>>>>>>>>>> 
>>>>>>>>>>> The timeout seems to be during the last map, so when the mappers
>>>>>>>>>>> reach
>>>>>>>>>>> 100%
>>>>>>>>>>> but still running. Maybe some kind of cleanup is happening?
>>>>>>>>>>> The first reducer is still "pending". The reducer never gets a
>>>>>>>>>>> chance to
>>>>>>>>>>> start.
>>>>>>>>>>> 
>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here--->
>>>>>>>>>>> 
>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs
>>>>>>>>>>> 76,000 terms
>>>>>>>>>>> 
>>>>>>>>>>> Any other info you need please ask.
>>>>>>>>>>> 
>>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout but I
>>>>>>>>>>> suspect
>>>>>>>>>>> there is something else going on here.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out stop
>>>>>>>>>>>> words
>>>>>>>>>>>> and
>>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other thing
>>>>>>>>>>>> that
>>>>>>>>>>>> affects tossing frequent terms and as I understand things,
>>>>>>>>>>>> tosses any
>>>>>>>>>>>> term
>>>>>>>>>>>> that appears in over 40% of the docs.
>>>>>>>>>>>> 
>>>>>>>>>>>> mahout seq2sparse \
>>>>>>>>>>>>       -i b2/seqfiles/ \
>>>>>>>>>>>>       -o b2/vectors/ \
>>>>>>>>>>>>       -ow \
>>>>>>>>>>>>       -chunk 2000 \
>>>>>>>>>>>>       -x 40 \
>>>>>>>>>>>>       -seq \
>>>>>>>>>>>>       -n 2 \
>>>>>>>>>>>>       -nv \
>>>>>>>>>>>>       -a com.finderbots.analyzers.LuceneStemmingAnalyzer
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>>>>>>>>>>> Hi Pat,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> have you removed highly frequent terms before launching
>>>>>>>>>>>>> rowsimilarity
>>>>>>>>>>>>> job?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It
>>>>>>>>>>>>>> continues
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task
>>>>>>>>>>>>>> so I've
>>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in
>>>>>>>>>>>>>> the logs
>>>>>>>>>>>>>> that I can see and no other task I've tried is acting like
>>>>>>>>>>>>>> this. Is
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> expected? Shouldn't the task check in more often?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it
>>>>>>>>>>>>>> is a bit
>>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout so
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>> should I?
>>>>>>>>>>>>>> 
>>>> 
>>>> 
>>> 
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 
>> 
> 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: RowSimilarity

Reply via email to