Lucene's MoreLikeThis feature does cosine distance (I think) directly
against term vectors.

On Sat, Jul 14, 2012 at 11:16 AM, Ted Dunning <[email protected]> wrote:
> Solr would do this well.  The upcoming knn package would do it differently
> and for different purposes, but also would do it well.
>
> On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel <[email protected]> wrote:
>
>> Intersting.
>>
>> I have another requirement, which is to do something like real time vector
>> based queries. Imagine taking a doc vector, reweighting some terms then
>> doing a query with it, perhaps in a truncated form. There are several ways
>> to do this but only solr would offer something real time results afaik. It
>> looks like I could use your approach below to do this. A quick look at
>> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb
>> the query vector into synthesized bi and tri grams for instance.
>>
>> I'd be interested in hearing more about how you use it. Is there a better
>> venue than the mahout list?
>>
>> On 7/13/12 9:41 PM, Ken Krugler wrote:
>>
>>> Hi Pat,
>>>
>>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:
>>>
>>>  I also do clustering so that's an obvious optimization I just haven't
>>>> gotten to it yet (doing similar only on docs clustered together). I'm also
>>>> trying to decide how to downsample. However the results from similarity are
>>>> quite good so understanding how to scale is #1.
>>>>
>>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs
>>>> similar to each docs.
>>>>
>>>> What I really need is to calculate the k most similar docs to a short
>>>> list, known ahead of time. I don't know of an algorithm to do this (other
>>>> than brute force). It would take a realatively small set of docs and find
>>>> similar docs in a much much larger set. Rowsimilarity finds all pair-wise
>>>> similarities. Strictly speaking I need only a tiny number of those.
>>>>
>>>> I think lucene has a weighted verctor based search that I need to
>>>> investigate it further.
>>>>
>>> As one point of reference, I've used Solr (Lucene) to do this, by taking
>>> the set of features (small, heavily reduced) from the target doc, using
>>> them (with weights) via edismax to find some top N candidate documents in
>>> the Lucene index which I'd built using the same approach (small set of
>>> features), and then calculating pair-wise similarity to rank the results.
>>>
>>> -- Ken
>>>
>>>  On 7/13/12 9:32 AM, Sebastian Schelter wrote:
>>>>
>>>>> Pat,
>>>>>
>>>>> RowSimilarityJob compares all pairs of rows, which is by definition a
>>>>> quadratic and therefore non-scalable problem. The comparison is however
>>>>> done in a way that only rows that have at least one non-zero value in a
>>>>> common dimension are compared.
>>>>>
>>>>> Therefore if you have certain sparse types of input such as ratings for
>>>>> example, you only have to look at a relatively small number of pairs and
>>>>> the comparison scales.
>>>>>
>>>>> RowSimilarityJob is mainly used for the collaborative filtering stuff in
>>>>> Mahout. We have a special job to prepare the data
>>>>> (PreparePreferenceMatrixJob) that will take care of sampling down
>>>>> entries in the rating matrix that might cause too much cooccurrences.
>>>>>
>>>>> If you directly use RowSimilarityJob, you have to ensure that your input
>>>>> data is of a shape suitable for the job. It seems to me that this is not
>>>>> the case, you created 76GB of intermediate output (cooccurring terms)
>>>>> from 35k documents, its clear that it takes hadoop a long time to sort
>>>>> that in the shuffle phase.
>>>>>
>>>>> My advice would be that you either take a deeper look at your data and
>>>>> try to downsample highly frequent terms more, or that you take a look at
>>>>> other techniques such as clustering or locality sensitive hashing to
>>>>> find similar documents.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>>
>>>>> On 13.07.2012 18:03, Pat Ferrel wrote:
>>>>>
>>>>>> I increased the timeout to 100 minutes and added another machine (does
>>>>>> the new machine matter in this case?). The job completed successfully.
>>>>>>
>>>>>> You say the algorithm is non-scalable--did you mean it's not
>>>>>> parallelizable? I assume I'll need to keep increasing this limit?
>>>>>>
>>>>>> I'm sure you know better than I that it is not really good for the
>>>>>> efficiency of a cluster to increase the timeout so far since it means
>>>>>> jobs can take much longer in the case of transient task failures.
>>>>>>
>>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote:
>>>>>>
>>>>>>> OK, thanks. I haven't checked for sparsity. However I have many
>>>>>>> successful runs of rowsimilarity with up to 150,000 docs and 250,000
>>>>>>> terms as I said below. This run has a much smaller matrix. I
>>>>>>> understand that spasity is a different question but anyway since the
>>>>>>> data in all cases is a crawl of the same sites I'd expect the same
>>>>>>> sparsity in all the data sets whether they succeeded or timed out.
>>>>>>>
>>>>>>> My issue has nothing to do with the elapsed time although I'll have to
>>>>>>> consider it in larger data sets (thanks for the heads up). Is it
>>>>>>> impossible to check in with the task tracker, avoiding a timeout? Or
>>>>>>> is there some other issue?
>>>>>>>
>>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote:
>>>>>>>
>>>>>>>> It's important to note that the performance of RowSimilarityJob
>>>>>>>> heavily depends on the sparsity of the input data, because in general
>>>>>>>> comparing all pairs of things is a quadratic (non-scalable) problem.
>>>>>>>>
>>>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>:
>>>>>>>>
>>>>>>>>> Sorry, I overread that its more than one machine. Could you provide
>>>>>>>>> the values for the counters from RowSimilarityJob (ROWS,
>>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
>>>>>>>>>
>>>>>>>>>> Thanks, actually there are two machines. I am testing before
>>>>>>>>>> spending on
>>>>>>>>>> AWS. It's failing the test in this case.
>>>>>>>>>>
>>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms with a
>>>>>>>>>> much
>>>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the
>>>>>>>>>> time and
>>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size.
>>>>>>>>>> Small runs
>>>>>>>>>> work fine on my laptop.
>>>>>>>>>>
>>>>>>>>>> I smell some kind of other problem than simple performance. In any
>>>>>>>>>> case in a
>>>>>>>>>> perfect world isn't the code supposed to check in often enough so
>>>>>>>>>> the
>>>>>>>>>> cluster config doesn't need to be tweaked for a specific job?
>>>>>>>>>>
>>>>>>>>>> It may be some problem of mine, of course. I see no obvious hadoop
>>>>>>>>>> or mahout
>>>>>>>>>> errors but there are many places to look.
>>>>>>>>>>
>>>>>>>>>> With a 100 minute timeout I am currently at the pause between map
>>>>>>>>>> and
>>>>>>>>>> reduce. If it fails would you like any specific logs?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
>>>>>>>>>>
>>>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a
>>>>>>>>>>> Hadoop
>>>>>>>>>>> job on a single machine. It's pretty obvious that you will get
>>>>>>>>>>> terrible performance.
>>>>>>>>>>>
>>>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9 hours
>>>>>>>>>>>> before
>>>>>>>>>>>> each
>>>>>>>>>>>> failure. This causes the job to take (after three tries) 27 hrs
>>>>>>>>>>>> to
>>>>>>>>>>>> completely fail. Oh, bother...
>>>>>>>>>>>>
>>>>>>>>>>>> The timeout seems to be during the last map, so when the mappers
>>>>>>>>>>>> reach
>>>>>>>>>>>> 100%
>>>>>>>>>>>> but still running. Maybe some kind of cleanup is happening?
>>>>>>>>>>>> The first reducer is still "pending". The reducer never gets a
>>>>>>>>>>>> chance to
>>>>>>>>>>>> start.
>>>>>>>>>>>>
>>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
>>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
>>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
>>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
>>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
>>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
>>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
>>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
>>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here--->
>>>>>>>>>>>>
>>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs
>>>>>>>>>>>> 76,000 terms
>>>>>>>>>>>>
>>>>>>>>>>>> Any other info you need please ask.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout
>>>>>>>>>>>> but I
>>>>>>>>>>>> suspect
>>>>>>>>>>>> there is something else going on here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out stop
>>>>>>>>>>>>> words
>>>>>>>>>>>>> and
>>>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other thing
>>>>>>>>>>>>> that
>>>>>>>>>>>>> affects tossing frequent terms and as I understand things,
>>>>>>>>>>>>> tosses any
>>>>>>>>>>>>> term
>>>>>>>>>>>>> that appears in over 40% of the docs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> mahout seq2sparse \
>>>>>>>>>>>>>        -i b2/seqfiles/ \
>>>>>>>>>>>>>        -o b2/vectors/ \
>>>>>>>>>>>>>        -ow \
>>>>>>>>>>>>>        -chunk 2000 \
>>>>>>>>>>>>>        -x 40 \
>>>>>>>>>>>>>        -seq \
>>>>>>>>>>>>>        -n 2 \
>>>>>>>>>>>>>        -nv \
>>>>>>>>>>>>>        -a com.finderbots.analyzers.**LuceneStemmingAnalyzer
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Pat,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> have you removed highly frequent terms before launching
>>>>>>>>>>>>>> rowsimilarity
>>>>>>>>>>>>>> job?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It
>>>>>>>>>>>>>>> continues
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> timeout on a RowSimilarityJob-**CooccurrencesMapper-Reducer
>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>> so I've
>>>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in
>>>>>>>>>>>>>>> the logs
>>>>>>>>>>>>>>> that I can see and no other task I've tried is acting like
>>>>>>>>>>>>>>> this. Is
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> expected? Shouldn't the task check in more often?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it
>>>>>>>>>>>>>>> is a bit
>>>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout so
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> should I?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>  --------------------------
>>> Ken Krugler
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Mahout & Solr
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>



-- 
Lance Norskog
[email protected]

Reply via email to