Lucene's MoreLikeThis feature does cosine distance (I think) directly against term vectors.
On Sat, Jul 14, 2012 at 11:16 AM, Ted Dunning <[email protected]> wrote: > Solr would do this well. The upcoming knn package would do it differently > and for different purposes, but also would do it well. > > On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel <[email protected]> wrote: > >> Intersting. >> >> I have another requirement, which is to do something like real time vector >> based queries. Imagine taking a doc vector, reweighting some terms then >> doing a query with it, perhaps in a truncated form. There are several ways >> to do this but only solr would offer something real time results afaik. It >> looks like I could use your approach below to do this. A quick look at >> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb >> the query vector into synthesized bi and tri grams for instance. >> >> I'd be interested in hearing more about how you use it. Is there a better >> venue than the mahout list? >> >> On 7/13/12 9:41 PM, Ken Krugler wrote: >> >>> Hi Pat, >>> >>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote: >>> >>> I also do clustering so that's an obvious optimization I just haven't >>>> gotten to it yet (doing similar only on docs clustered together). I'm also >>>> trying to decide how to downsample. However the results from similarity are >>>> quite good so understanding how to scale is #1. >>>> >>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs >>>> similar to each docs. >>>> >>>> What I really need is to calculate the k most similar docs to a short >>>> list, known ahead of time. I don't know of an algorithm to do this (other >>>> than brute force). It would take a realatively small set of docs and find >>>> similar docs in a much much larger set. Rowsimilarity finds all pair-wise >>>> similarities. Strictly speaking I need only a tiny number of those. >>>> >>>> I think lucene has a weighted verctor based search that I need to >>>> investigate it further. >>>> >>> As one point of reference, I've used Solr (Lucene) to do this, by taking >>> the set of features (small, heavily reduced) from the target doc, using >>> them (with weights) via edismax to find some top N candidate documents in >>> the Lucene index which I'd built using the same approach (small set of >>> features), and then calculating pair-wise similarity to rank the results. >>> >>> -- Ken >>> >>> On 7/13/12 9:32 AM, Sebastian Schelter wrote: >>>> >>>>> Pat, >>>>> >>>>> RowSimilarityJob compares all pairs of rows, which is by definition a >>>>> quadratic and therefore non-scalable problem. The comparison is however >>>>> done in a way that only rows that have at least one non-zero value in a >>>>> common dimension are compared. >>>>> >>>>> Therefore if you have certain sparse types of input such as ratings for >>>>> example, you only have to look at a relatively small number of pairs and >>>>> the comparison scales. >>>>> >>>>> RowSimilarityJob is mainly used for the collaborative filtering stuff in >>>>> Mahout. We have a special job to prepare the data >>>>> (PreparePreferenceMatrixJob) that will take care of sampling down >>>>> entries in the rating matrix that might cause too much cooccurrences. >>>>> >>>>> If you directly use RowSimilarityJob, you have to ensure that your input >>>>> data is of a shape suitable for the job. It seems to me that this is not >>>>> the case, you created 76GB of intermediate output (cooccurring terms) >>>>> from 35k documents, its clear that it takes hadoop a long time to sort >>>>> that in the shuffle phase. >>>>> >>>>> My advice would be that you either take a deeper look at your data and >>>>> try to downsample highly frequent terms more, or that you take a look at >>>>> other techniques such as clustering or locality sensitive hashing to >>>>> find similar documents. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> >>>>> On 13.07.2012 18:03, Pat Ferrel wrote: >>>>> >>>>>> I increased the timeout to 100 minutes and added another machine (does >>>>>> the new machine matter in this case?). The job completed successfully. >>>>>> >>>>>> You say the algorithm is non-scalable--did you mean it's not >>>>>> parallelizable? I assume I'll need to keep increasing this limit? >>>>>> >>>>>> I'm sure you know better than I that it is not really good for the >>>>>> efficiency of a cluster to increase the timeout so far since it means >>>>>> jobs can take much longer in the case of transient task failures. >>>>>> >>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote: >>>>>> >>>>>>> OK, thanks. I haven't checked for sparsity. However I have many >>>>>>> successful runs of rowsimilarity with up to 150,000 docs and 250,000 >>>>>>> terms as I said below. This run has a much smaller matrix. I >>>>>>> understand that spasity is a different question but anyway since the >>>>>>> data in all cases is a crawl of the same sites I'd expect the same >>>>>>> sparsity in all the data sets whether they succeeded or timed out. >>>>>>> >>>>>>> My issue has nothing to do with the elapsed time although I'll have to >>>>>>> consider it in larger data sets (thanks for the heads up). Is it >>>>>>> impossible to check in with the task tracker, avoiding a timeout? Or >>>>>>> is there some other issue? >>>>>>> >>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote: >>>>>>> >>>>>>>> It's important to note that the performance of RowSimilarityJob >>>>>>>> heavily depends on the sparsity of the input data, because in general >>>>>>>> comparing all pairs of things is a quadratic (non-scalable) problem. >>>>>>>> >>>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>: >>>>>>>> >>>>>>>>> Sorry, I overread that its more than one machine. Could you provide >>>>>>>>> the values for the counters from RowSimilarityJob (ROWS, >>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: >>>>>>>>> >>>>>>>>>> Thanks, actually there are two machines. I am testing before >>>>>>>>>> spending on >>>>>>>>>> AWS. It's failing the test in this case. >>>>>>>>>> >>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms with a >>>>>>>>>> much >>>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the >>>>>>>>>> time and >>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size. >>>>>>>>>> Small runs >>>>>>>>>> work fine on my laptop. >>>>>>>>>> >>>>>>>>>> I smell some kind of other problem than simple performance. In any >>>>>>>>>> case in a >>>>>>>>>> perfect world isn't the code supposed to check in often enough so >>>>>>>>>> the >>>>>>>>>> cluster config doesn't need to be tweaked for a specific job? >>>>>>>>>> >>>>>>>>>> It may be some problem of mine, of course. I see no obvious hadoop >>>>>>>>>> or mahout >>>>>>>>>> errors but there are many places to look. >>>>>>>>>> >>>>>>>>>> With a 100 minute timeout I am currently at the pause between map >>>>>>>>>> and >>>>>>>>>> reduce. If it fails would you like any specific logs? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote: >>>>>>>>>> >>>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a >>>>>>>>>>> Hadoop >>>>>>>>>>> job on a single machine. It's pretty obvious that you will get >>>>>>>>>>> terrible performance. >>>>>>>>>>> >>>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: >>>>>>>>>>> >>>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9 hours >>>>>>>>>>>> before >>>>>>>>>>>> each >>>>>>>>>>>> failure. This causes the job to take (after three tries) 27 hrs >>>>>>>>>>>> to >>>>>>>>>>>> completely fail. Oh, bother... >>>>>>>>>>>> >>>>>>>>>>>> The timeout seems to be during the last map, so when the mappers >>>>>>>>>>>> reach >>>>>>>>>>>> 100% >>>>>>>>>>>> but still running. Maybe some kind of cleanup is happening? >>>>>>>>>>>> The first reducer is still "pending". The reducer never gets a >>>>>>>>>>>> chance to >>>>>>>>>>>> start. >>>>>>>>>>>> >>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% >>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% >>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% >>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% >>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% >>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% >>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% >>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% >>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% >>>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here---> >>>>>>>>>>>> >>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs >>>>>>>>>>>> 76,000 terms >>>>>>>>>>>> >>>>>>>>>>>> Any other info you need please ask. >>>>>>>>>>>> >>>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout >>>>>>>>>>>> but I >>>>>>>>>>>> suspect >>>>>>>>>>>> there is something else going on here. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out stop >>>>>>>>>>>>> words >>>>>>>>>>>>> and >>>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other thing >>>>>>>>>>>>> that >>>>>>>>>>>>> affects tossing frequent terms and as I understand things, >>>>>>>>>>>>> tosses any >>>>>>>>>>>>> term >>>>>>>>>>>>> that appears in over 40% of the docs. >>>>>>>>>>>>> >>>>>>>>>>>>> mahout seq2sparse \ >>>>>>>>>>>>> -i b2/seqfiles/ \ >>>>>>>>>>>>> -o b2/vectors/ \ >>>>>>>>>>>>> -ow \ >>>>>>>>>>>>> -chunk 2000 \ >>>>>>>>>>>>> -x 40 \ >>>>>>>>>>>>> -seq \ >>>>>>>>>>>>> -n 2 \ >>>>>>>>>>>>> -nv \ >>>>>>>>>>>>> -a com.finderbots.analyzers.**LuceneStemmingAnalyzer >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Pat, >>>>>>>>>>>>>> >>>>>>>>>>>>>> have you removed highly frequent terms before launching >>>>>>>>>>>>>> rowsimilarity >>>>>>>>>>>>>> job? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It >>>>>>>>>>>>>>> continues >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> timeout on a RowSimilarityJob-**CooccurrencesMapper-Reducer >>>>>>>>>>>>>>> task >>>>>>>>>>>>>>> so I've >>>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in >>>>>>>>>>>>>>> the logs >>>>>>>>>>>>>>> that I can see and no other task I've tried is acting like >>>>>>>>>>>>>>> this. Is >>>>>>>>>>>>>>> this >>>>>>>>>>>>>>> expected? Shouldn't the task check in more often? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it >>>>>>>>>>>>>>> is a bit >>>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout so >>>>>>>>>>>>>>> high >>>>>>>>>>>>>>> should I? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>> >>>>> >>>> -------------------------- >>> Ken Krugler >>> http://www.scaleunlimited.com >>> custom big data solutions & training >>> Hadoop, Cascading, Mahout & Solr >>> >>> >>> >>> >>> >>> >> >> -- Lance Norskog [email protected]
