Re: RowSimilarity

Ted Dunning Sat, 14 Jul 2012 17:30:22 -0700

I would call it kinda-cosine distance.  There are some intricate
normalization factors.


On Sat, Jul 14, 2012 at 5:22 PM, Lance Norskog <[email protected]> wrote:

> Lucene's MoreLikeThis feature does cosine distance (I think) directly
> against term vectors.
>
> On Sat, Jul 14, 2012 at 11:16 AM, Ted Dunning <[email protected]>
> wrote:
> > Solr would do this well.  The upcoming knn package would do it
> differently
> > and for different purposes, but also would do it well.
> >
> > On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel <[email protected]>
> wrote:
> >
> >> Intersting.
> >>
> >> I have another requirement, which is to do something like real time
> vector
> >> based queries. Imagine taking a doc vector, reweighting some terms then
> >> doing a query with it, perhaps in a truncated form. There are several
> ways
> >> to do this but only solr would offer something real time results afaik.
> It
> >> looks like I could use your approach below to do this. A quick look at
> >> eDisMax however suggests some problems. The use of pf2 and pf3 would
> jamb
> >> the query vector into synthesized bi and tri grams for instance.
> >>
> >> I'd be interested in hearing more about how you use it. Is there a
> better
> >> venue than the mahout list?
> >>
> >> On 7/13/12 9:41 PM, Ken Krugler wrote:
> >>
> >>> Hi Pat,
> >>>
> >>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:
> >>>
> >>>  I also do clustering so that's an obvious optimization I just haven't
> >>>> gotten to it yet (doing similar only on docs clustered together). I'm
> also
> >>>> trying to decide how to downsample. However the results from
> similarity are
> >>>> quite good so understanding how to scale is #1.
> >>>>
> >>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs
> >>>> similar to each docs.
> >>>>
> >>>> What I really need is to calculate the k most similar docs to a short
> >>>> list, known ahead of time. I don't know of an algorithm to do this
> (other
> >>>> than brute force). It would take a realatively small set of docs and
> find
> >>>> similar docs in a much much larger set. Rowsimilarity finds all
> pair-wise
> >>>> similarities. Strictly speaking I need only a tiny number of those.
> >>>>
> >>>> I think lucene has a weighted verctor based search that I need to
> >>>> investigate it further.
> >>>>
> >>> As one point of reference, I've used Solr (Lucene) to do this, by
> taking
> >>> the set of features (small, heavily reduced) from the target doc, using
> >>> them (with weights) via edismax to find some top N candidate documents
> in
> >>> the Lucene index which I'd built using the same approach (small set of
> >>> features), and then calculating pair-wise similarity to rank the
> results.
> >>>
> >>> -- Ken
> >>>
> >>>  On 7/13/12 9:32 AM, Sebastian Schelter wrote:
> >>>>
> >>>>> Pat,
> >>>>>
> >>>>> RowSimilarityJob compares all pairs of rows, which is by definition a
> >>>>> quadratic and therefore non-scalable problem. The comparison is
> however
> >>>>> done in a way that only rows that have at least one non-zero value
> in a
> >>>>> common dimension are compared.
> >>>>>
> >>>>> Therefore if you have certain sparse types of input such as ratings
> for
> >>>>> example, you only have to look at a relatively small number of pairs
> and
> >>>>> the comparison scales.
> >>>>>
> >>>>> RowSimilarityJob is mainly used for the collaborative filtering
> stuff in
> >>>>> Mahout. We have a special job to prepare the data
> >>>>> (PreparePreferenceMatrixJob) that will take care of sampling down
> >>>>> entries in the rating matrix that might cause too much cooccurrences.
> >>>>>
> >>>>> If you directly use RowSimilarityJob, you have to ensure that your
> input
> >>>>> data is of a shape suitable for the job. It seems to me that this is
> not
> >>>>> the case, you created 76GB of intermediate output (cooccurring terms)
> >>>>> from 35k documents, its clear that it takes hadoop a long time to
> sort
> >>>>> that in the shuffle phase.
> >>>>>
> >>>>> My advice would be that you either take a deeper look at your data
> and
> >>>>> try to downsample highly frequent terms more, or that you take a
> look at
> >>>>> other techniques such as clustering or locality sensitive hashing to
> >>>>> find similar documents.
> >>>>>
> >>>>> Best,
> >>>>> Sebastian
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 13.07.2012 18:03, Pat Ferrel wrote:
> >>>>>
> >>>>>> I increased the timeout to 100 minutes and added another machine
> (does
> >>>>>> the new machine matter in this case?). The job completed
> successfully.
> >>>>>>
> >>>>>> You say the algorithm is non-scalable--did you mean it's not
> >>>>>> parallelizable? I assume I'll need to keep increasing this limit?
> >>>>>>
> >>>>>> I'm sure you know better than I that it is not really good for the
> >>>>>> efficiency of a cluster to increase the timeout so far since it
> means
> >>>>>> jobs can take much longer in the case of transient task failures.
> >>>>>>
> >>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote:
> >>>>>>
> >>>>>>> OK, thanks. I haven't checked for sparsity. However I have many
> >>>>>>> successful runs of rowsimilarity with up to 150,000 docs and
> 250,000
> >>>>>>> terms as I said below. This run has a much smaller matrix. I
> >>>>>>> understand that spasity is a different question but anyway since
> the
> >>>>>>> data in all cases is a crawl of the same sites I'd expect the same
> >>>>>>> sparsity in all the data sets whether they succeeded or timed out.
> >>>>>>>
> >>>>>>> My issue has nothing to do with the elapsed time although I'll
> have to
> >>>>>>> consider it in larger data sets (thanks for the heads up). Is it
> >>>>>>> impossible to check in with the task tracker, avoiding a timeout?
> Or
> >>>>>>> is there some other issue?
> >>>>>>>
> >>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote:
> >>>>>>>
> >>>>>>>> It's important to note that the performance of RowSimilarityJob
> >>>>>>>> heavily depends on the sparsity of the input data, because in
> general
> >>>>>>>> comparing all pairs of things is a quadratic (non-scalable)
> problem.
> >>>>>>>>
> >>>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>:
> >>>>>>>>
> >>>>>>>>> Sorry, I overread that its more than one machine. Could you
> provide
> >>>>>>>>> the values for the counters from RowSimilarityJob (ROWS,
> >>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Sebastian
> >>>>>>>>>
> >>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
> >>>>>>>>>
> >>>>>>>>>> Thanks, actually there are two machines. I am testing before
> >>>>>>>>>> spending on
> >>>>>>>>>> AWS. It's failing the test in this case.
> >>>>>>>>>>
> >>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms
> with a
> >>>>>>>>>> much
> >>>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the
> >>>>>>>>>> time and
> >>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size.
> >>>>>>>>>> Small runs
> >>>>>>>>>> work fine on my laptop.
> >>>>>>>>>>
> >>>>>>>>>> I smell some kind of other problem than simple performance. In
> any
> >>>>>>>>>> case in a
> >>>>>>>>>> perfect world isn't the code supposed to check in often enough
> so
> >>>>>>>>>> the
> >>>>>>>>>> cluster config doesn't need to be tweaked for a specific job?
> >>>>>>>>>>
> >>>>>>>>>> It may be some problem of mine, of course. I see no obvious
> hadoop
> >>>>>>>>>> or mahout
> >>>>>>>>>> errors but there are many places to look.
> >>>>>>>>>>
> >>>>>>>>>> With a 100 minute timeout I am currently at the pause between
> map
> >>>>>>>>>> and
> >>>>>>>>>> reduce. If it fails would you like any specific logs?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
> >>>>>>>>>>
> >>>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a
> >>>>>>>>>>> Hadoop
> >>>>>>>>>>> job on a single machine. It's pretty obvious that you will get
> >>>>>>>>>>> terrible performance.
> >>>>>>>>>>>
> >>>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>:
> >>>>>>>>>>>
> >>>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9
> hours
> >>>>>>>>>>>> before
> >>>>>>>>>>>> each
> >>>>>>>>>>>> failure. This causes the job to take (after three tries) 27
> hrs
> >>>>>>>>>>>> to
> >>>>>>>>>>>> completely fail. Oh, bother...
> >>>>>>>>>>>>
> >>>>>>>>>>>> The timeout seems to be during the last map, so when the
> mappers
> >>>>>>>>>>>> reach
> >>>>>>>>>>>> 100%
> >>>>>>>>>>>> but still running. Maybe some kind of cleanup is happening?
> >>>>>>>>>>>> The first reducer is still "pending". The reducer never gets a
> >>>>>>>>>>>> chance to
> >>>>>>>>>>>> start.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
> >>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
> >>>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here--->
> >>>>>>>>>>>>
> >>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs
> >>>>>>>>>>>> 76,000 terms
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any other info you need please ask.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout
> >>>>>>>>>>>> but I
> >>>>>>>>>>>> suspect
> >>>>>>>>>>>> there is something else going on here.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out
> stop
> >>>>>>>>>>>>> words
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other
> thing
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>> affects tossing frequent terms and as I understand things,
> >>>>>>>>>>>>> tosses any
> >>>>>>>>>>>>> term
> >>>>>>>>>>>>> that appears in over 40% of the docs.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> mahout seq2sparse \
> >>>>>>>>>>>>>        -i b2/seqfiles/ \
> >>>>>>>>>>>>>        -o b2/vectors/ \
> >>>>>>>>>>>>>        -ow \
> >>>>>>>>>>>>>        -chunk 2000 \
> >>>>>>>>>>>>>        -x 40 \
> >>>>>>>>>>>>>        -seq \
> >>>>>>>>>>>>>        -n 2 \
> >>>>>>>>>>>>>        -nv \
> >>>>>>>>>>>>>        -a com.finderbots.analyzers.**LuceneStemmingAnalyzer
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Pat,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> have you removed highly frequent terms before launching
> >>>>>>>>>>>>>> rowsimilarity
> >>>>>>>>>>>>>> job?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It
> >>>>>>>>>>>>>>> continues
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> timeout on a RowSimilarityJob-**CooccurrencesMapper-Reducer
> >>>>>>>>>>>>>>> task
> >>>>>>>>>>>>>>> so I've
> >>>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in
> >>>>>>>>>>>>>>> the logs
> >>>>>>>>>>>>>>> that I can see and no other task I've tried is acting like
> >>>>>>>>>>>>>>> this. Is
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>> expected? Shouldn't the task check in more often?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so
> it
> >>>>>>>>>>>>>>> is a bit
> >>>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout
> so
> >>>>>>>>>>>>>>> high
> >>>>>>>>>>>>>>> should I?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>
> >>>>>
> >>>>  --------------------------
> >>> Ken Krugler
> >>> http://www.scaleunlimited.com
> >>> custom big data solutions & training
> >>> Hadoop, Cascading, Mahout & Solr
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: RowSimilarity

Reply via email to