I would call it kinda-cosine distance. There are some intricate normalization factors.
On Sat, Jul 14, 2012 at 5:22 PM, Lance Norskog <[email protected]> wrote: > Lucene's MoreLikeThis feature does cosine distance (I think) directly > against term vectors. > > On Sat, Jul 14, 2012 at 11:16 AM, Ted Dunning <[email protected]> > wrote: > > Solr would do this well. The upcoming knn package would do it > differently > > and for different purposes, but also would do it well. > > > > On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel <[email protected]> > wrote: > > > >> Intersting. > >> > >> I have another requirement, which is to do something like real time > vector > >> based queries. Imagine taking a doc vector, reweighting some terms then > >> doing a query with it, perhaps in a truncated form. There are several > ways > >> to do this but only solr would offer something real time results afaik. > It > >> looks like I could use your approach below to do this. A quick look at > >> eDisMax however suggests some problems. The use of pf2 and pf3 would > jamb > >> the query vector into synthesized bi and tri grams for instance. > >> > >> I'd be interested in hearing more about how you use it. Is there a > better > >> venue than the mahout list? > >> > >> On 7/13/12 9:41 PM, Ken Krugler wrote: > >> > >>> Hi Pat, > >>> > >>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote: > >>> > >>> I also do clustering so that's an obvious optimization I just haven't > >>>> gotten to it yet (doing similar only on docs clustered together). I'm > also > >>>> trying to decide how to downsample. However the results from > similarity are > >>>> quite good so understanding how to scale is #1. > >>>> > >>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs > >>>> similar to each docs. > >>>> > >>>> What I really need is to calculate the k most similar docs to a short > >>>> list, known ahead of time. I don't know of an algorithm to do this > (other > >>>> than brute force). It would take a realatively small set of docs and > find > >>>> similar docs in a much much larger set. Rowsimilarity finds all > pair-wise > >>>> similarities. Strictly speaking I need only a tiny number of those. > >>>> > >>>> I think lucene has a weighted verctor based search that I need to > >>>> investigate it further. > >>>> > >>> As one point of reference, I've used Solr (Lucene) to do this, by > taking > >>> the set of features (small, heavily reduced) from the target doc, using > >>> them (with weights) via edismax to find some top N candidate documents > in > >>> the Lucene index which I'd built using the same approach (small set of > >>> features), and then calculating pair-wise similarity to rank the > results. > >>> > >>> -- Ken > >>> > >>> On 7/13/12 9:32 AM, Sebastian Schelter wrote: > >>>> > >>>>> Pat, > >>>>> > >>>>> RowSimilarityJob compares all pairs of rows, which is by definition a > >>>>> quadratic and therefore non-scalable problem. The comparison is > however > >>>>> done in a way that only rows that have at least one non-zero value > in a > >>>>> common dimension are compared. > >>>>> > >>>>> Therefore if you have certain sparse types of input such as ratings > for > >>>>> example, you only have to look at a relatively small number of pairs > and > >>>>> the comparison scales. > >>>>> > >>>>> RowSimilarityJob is mainly used for the collaborative filtering > stuff in > >>>>> Mahout. We have a special job to prepare the data > >>>>> (PreparePreferenceMatrixJob) that will take care of sampling down > >>>>> entries in the rating matrix that might cause too much cooccurrences. > >>>>> > >>>>> If you directly use RowSimilarityJob, you have to ensure that your > input > >>>>> data is of a shape suitable for the job. It seems to me that this is > not > >>>>> the case, you created 76GB of intermediate output (cooccurring terms) > >>>>> from 35k documents, its clear that it takes hadoop a long time to > sort > >>>>> that in the shuffle phase. > >>>>> > >>>>> My advice would be that you either take a deeper look at your data > and > >>>>> try to downsample highly frequent terms more, or that you take a > look at > >>>>> other techniques such as clustering or locality sensitive hashing to > >>>>> find similar documents. > >>>>> > >>>>> Best, > >>>>> Sebastian > >>>>> > >>>>> > >>>>> > >>>>> On 13.07.2012 18:03, Pat Ferrel wrote: > >>>>> > >>>>>> I increased the timeout to 100 minutes and added another machine > (does > >>>>>> the new machine matter in this case?). The job completed > successfully. > >>>>>> > >>>>>> You say the algorithm is non-scalable--did you mean it's not > >>>>>> parallelizable? I assume I'll need to keep increasing this limit? > >>>>>> > >>>>>> I'm sure you know better than I that it is not really good for the > >>>>>> efficiency of a cluster to increase the timeout so far since it > means > >>>>>> jobs can take much longer in the case of transient task failures. > >>>>>> > >>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote: > >>>>>> > >>>>>>> OK, thanks. I haven't checked for sparsity. However I have many > >>>>>>> successful runs of rowsimilarity with up to 150,000 docs and > 250,000 > >>>>>>> terms as I said below. This run has a much smaller matrix. I > >>>>>>> understand that spasity is a different question but anyway since > the > >>>>>>> data in all cases is a crawl of the same sites I'd expect the same > >>>>>>> sparsity in all the data sets whether they succeeded or timed out. > >>>>>>> > >>>>>>> My issue has nothing to do with the elapsed time although I'll > have to > >>>>>>> consider it in larger data sets (thanks for the heads up). Is it > >>>>>>> impossible to check in with the task tracker, avoiding a timeout? > Or > >>>>>>> is there some other issue? > >>>>>>> > >>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote: > >>>>>>> > >>>>>>>> It's important to note that the performance of RowSimilarityJob > >>>>>>>> heavily depends on the sparsity of the input data, because in > general > >>>>>>>> comparing all pairs of things is a quadratic (non-scalable) > problem. > >>>>>>>> > >>>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>: > >>>>>>>> > >>>>>>>>> Sorry, I overread that its more than one machine. Could you > provide > >>>>>>>>> the values for the counters from RowSimilarityJob (ROWS, > >>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)? > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Sebastian > >>>>>>>>> > >>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: > >>>>>>>>> > >>>>>>>>>> Thanks, actually there are two machines. I am testing before > >>>>>>>>>> spending on > >>>>>>>>>> AWS. It's failing the test in this case. > >>>>>>>>>> > >>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms > with a > >>>>>>>>>> much > >>>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the > >>>>>>>>>> time and > >>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size. > >>>>>>>>>> Small runs > >>>>>>>>>> work fine on my laptop. > >>>>>>>>>> > >>>>>>>>>> I smell some kind of other problem than simple performance. In > any > >>>>>>>>>> case in a > >>>>>>>>>> perfect world isn't the code supposed to check in often enough > so > >>>>>>>>>> the > >>>>>>>>>> cluster config doesn't need to be tweaked for a specific job? > >>>>>>>>>> > >>>>>>>>>> It may be some problem of mine, of course. I see no obvious > hadoop > >>>>>>>>>> or mahout > >>>>>>>>>> errors but there are many places to look. > >>>>>>>>>> > >>>>>>>>>> With a 100 minute timeout I am currently at the pause between > map > >>>>>>>>>> and > >>>>>>>>>> reduce. If it fails would you like any specific logs? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote: > >>>>>>>>>> > >>>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a > >>>>>>>>>>> Hadoop > >>>>>>>>>>> job on a single machine. It's pretty obvious that you will get > >>>>>>>>>>> terrible performance. > >>>>>>>>>>> > >>>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: > >>>>>>>>>>> > >>>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9 > hours > >>>>>>>>>>>> before > >>>>>>>>>>>> each > >>>>>>>>>>>> failure. This causes the job to take (after three tries) 27 > hrs > >>>>>>>>>>>> to > >>>>>>>>>>>> completely fail. Oh, bother... > >>>>>>>>>>>> > >>>>>>>>>>>> The timeout seems to be during the last map, so when the > mappers > >>>>>>>>>>>> reach > >>>>>>>>>>>> 100% > >>>>>>>>>>>> but still running. Maybe some kind of cleanup is happening? > >>>>>>>>>>>> The first reducer is still "pending". The reducer never gets a > >>>>>>>>>>>> chance to > >>>>>>>>>>>> start. > >>>>>>>>>>>> > >>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% > >>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% > >>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% > >>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% > >>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% > >>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% > >>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% > >>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% > >>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% > >>>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here---> > >>>>>>>>>>>> > >>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs > >>>>>>>>>>>> 76,000 terms > >>>>>>>>>>>> > >>>>>>>>>>>> Any other info you need please ask. > >>>>>>>>>>>> > >>>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout > >>>>>>>>>>>> but I > >>>>>>>>>>>> suspect > >>>>>>>>>>>> there is something else going on here. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out > stop > >>>>>>>>>>>>> words > >>>>>>>>>>>>> and > >>>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other > thing > >>>>>>>>>>>>> that > >>>>>>>>>>>>> affects tossing frequent terms and as I understand things, > >>>>>>>>>>>>> tosses any > >>>>>>>>>>>>> term > >>>>>>>>>>>>> that appears in over 40% of the docs. > >>>>>>>>>>>>> > >>>>>>>>>>>>> mahout seq2sparse \ > >>>>>>>>>>>>> -i b2/seqfiles/ \ > >>>>>>>>>>>>> -o b2/vectors/ \ > >>>>>>>>>>>>> -ow \ > >>>>>>>>>>>>> -chunk 2000 \ > >>>>>>>>>>>>> -x 40 \ > >>>>>>>>>>>>> -seq \ > >>>>>>>>>>>>> -n 2 \ > >>>>>>>>>>>>> -nv \ > >>>>>>>>>>>>> -a com.finderbots.analyzers.**LuceneStemmingAnalyzer > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Pat, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> have you removed highly frequent terms before launching > >>>>>>>>>>>>>> rowsimilarity > >>>>>>>>>>>>>> job? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It > >>>>>>>>>>>>>>> continues > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> timeout on a RowSimilarityJob-**CooccurrencesMapper-Reducer > >>>>>>>>>>>>>>> task > >>>>>>>>>>>>>>> so I've > >>>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in > >>>>>>>>>>>>>>> the logs > >>>>>>>>>>>>>>> that I can see and no other task I've tried is acting like > >>>>>>>>>>>>>>> this. Is > >>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>> expected? Shouldn't the task check in more often? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so > it > >>>>>>>>>>>>>>> is a bit > >>>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout > so > >>>>>>>>>>>>>>> high > >>>>>>>>>>>>>>> should I? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>> > >>>>> > >>>> -------------------------- > >>> Ken Krugler > >>> http://www.scaleunlimited.com > >>> custom big data solutions & training > >>> Hadoop, Cascading, Mahout & Solr > >>> > >>> > >>> > >>> > >>> > >>> > >> > >> > > > > -- > Lance Norskog > [email protected] >
