Hi Pat, On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
> Intersting. > > I have another requirement, which is to do something like real time vector > based queries. Imagine taking a doc vector, reweighting some terms then doing > a query with it, perhaps in a truncated form. There are several ways to do > this but only solr would offer something real time results afaik. It looks > like I could use your approach below to do this. A quick look at eDisMax > however suggests some problems. The use of pf2 and pf3 would jamb the query > vector into synthesized bi and tri grams for instance. The simplistic approach I used was to extract the top 50 terms (with TF*IDF weights) from the target document, then use those terms (with weights as boosts) to do a regular Lucene OR query & request the top 20 hits. The index I'm searching against has Solr documents with a multi-value field that contains the top 50 terms, generated using the same approach as with the target document. It also contains stored weights for each of those terms. I didn't use payload boosting, but could have to improve the quality of this search - seemed to be working well enough, and speed was pretty important. Solr returns back a sorted list of hits, and then I do a regular vector similarity calculation between the target & each of these top 20 hits, and select the best one (assuming it passes a similarity threshold). > I'd be interested in hearing more about how you use it. Is there a better > venue than the mahout list? If you'd like more details, that's probably better for an off-list discussion…doesn't feel very Mahout-ish in nature :) Though a discussion of the major problem (how to extract "good" terms from the text) would be very interesting, as I wound up crafting what felt like a kludgy pseudo-NLP solution. -- Ken > > On 7/13/12 9:41 PM, Ken Krugler wrote: >> Hi Pat, >> >> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote: >> >>> I also do clustering so that's an obvious optimization I just haven't >>> gotten to it yet (doing similar only on docs clustered together). I'm also >>> trying to decide how to downsample. However the results from similarity are >>> quite good so understanding how to scale is #1. >>> >>> Clustering gives docs closest to a centroid. RowSimilarity finds docs >>> similar to each docs. >>> >>> What I really need is to calculate the k most similar docs to a short list, >>> known ahead of time. I don't know of an algorithm to do this (other than >>> brute force). It would take a realatively small set of docs and find >>> similar docs in a much much larger set. Rowsimilarity finds all pair-wise >>> similarities. Strictly speaking I need only a tiny number of those. >>> >>> I think lucene has a weighted verctor based search that I need to >>> investigate it further. >> As one point of reference, I've used Solr (Lucene) to do this, by taking the >> set of features (small, heavily reduced) from the target doc, using them >> (with weights) via edismax to find some top N candidate documents in the >> Lucene index which I'd built using the same approach (small set of >> features), and then calculating pair-wise similarity to rank the results. >> >> -- Ken >> >>> On 7/13/12 9:32 AM, Sebastian Schelter wrote: >>>> Pat, >>>> >>>> RowSimilarityJob compares all pairs of rows, which is by definition a >>>> quadratic and therefore non-scalable problem. The comparison is however >>>> done in a way that only rows that have at least one non-zero value in a >>>> common dimension are compared. >>>> >>>> Therefore if you have certain sparse types of input such as ratings for >>>> example, you only have to look at a relatively small number of pairs and >>>> the comparison scales. >>>> >>>> RowSimilarityJob is mainly used for the collaborative filtering stuff in >>>> Mahout. We have a special job to prepare the data >>>> (PreparePreferenceMatrixJob) that will take care of sampling down >>>> entries in the rating matrix that might cause too much cooccurrences. >>>> >>>> If you directly use RowSimilarityJob, you have to ensure that your input >>>> data is of a shape suitable for the job. It seems to me that this is not >>>> the case, you created 76GB of intermediate output (cooccurring terms) >>>> from 35k documents, its clear that it takes hadoop a long time to sort >>>> that in the shuffle phase. >>>> >>>> My advice would be that you either take a deeper look at your data and >>>> try to downsample highly frequent terms more, or that you take a look at >>>> other techniques such as clustering or locality sensitive hashing to >>>> find similar documents. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> >>>> On 13.07.2012 18:03, Pat Ferrel wrote: >>>>> I increased the timeout to 100 minutes and added another machine (does >>>>> the new machine matter in this case?). The job completed successfully. >>>>> >>>>> You say the algorithm is non-scalable--did you mean it's not >>>>> parallelizable? I assume I'll need to keep increasing this limit? >>>>> >>>>> I'm sure you know better than I that it is not really good for the >>>>> efficiency of a cluster to increase the timeout so far since it means >>>>> jobs can take much longer in the case of transient task failures. >>>>> >>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote: >>>>>> OK, thanks. I haven't checked for sparsity. However I have many >>>>>> successful runs of rowsimilarity with up to 150,000 docs and 250,000 >>>>>> terms as I said below. This run has a much smaller matrix. I >>>>>> understand that spasity is a different question but anyway since the >>>>>> data in all cases is a crawl of the same sites I'd expect the same >>>>>> sparsity in all the data sets whether they succeeded or timed out. >>>>>> >>>>>> My issue has nothing to do with the elapsed time although I'll have to >>>>>> consider it in larger data sets (thanks for the heads up). Is it >>>>>> impossible to check in with the task tracker, avoiding a timeout? Or >>>>>> is there some other issue? >>>>>> >>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote: >>>>>>> It's important to note that the performance of RowSimilarityJob >>>>>>> heavily depends on the sparsity of the input data, because in general >>>>>>> comparing all pairs of things is a quadratic (non-scalable) problem. >>>>>>> >>>>>>> 2012/7/12 Sebastian Schelter <[email protected]>: >>>>>>>> Sorry, I overread that its more than one machine. Could you provide >>>>>>>> the values for the counters from RowSimilarityJob (ROWS, >>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)? >>>>>>>> >>>>>>>> Best, >>>>>>>> Sebastian >>>>>>>> >>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: >>>>>>>>> Thanks, actually there are two machines. I am testing before >>>>>>>>> spending on >>>>>>>>> AWS. It's failing the test in this case. >>>>>>>>> >>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000 terms with a >>>>>>>>> much >>>>>>>>> lower timeout (30000000) all worked fine. I was using 0.6 at the >>>>>>>>> time and >>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of any size. >>>>>>>>> Small runs >>>>>>>>> work fine on my laptop. >>>>>>>>> >>>>>>>>> I smell some kind of other problem than simple performance. In any >>>>>>>>> case in a >>>>>>>>> perfect world isn't the code supposed to check in often enough so the >>>>>>>>> cluster config doesn't need to be tweaked for a specific job? >>>>>>>>> >>>>>>>>> It may be some problem of mine, of course. I see no obvious hadoop >>>>>>>>> or mahout >>>>>>>>> errors but there are many places to look. >>>>>>>>> >>>>>>>>> With a 100 minute timeout I am currently at the pause between map and >>>>>>>>> reduce. If it fails would you like any specific logs? >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote: >>>>>>>>>> To be honest, I don't think it makes a lot of sense to test a Hadoop >>>>>>>>>> job on a single machine. It's pretty obvious that you will get >>>>>>>>>> terrible performance. >>>>>>>>>> >>>>>>>>>> 2012/7/12 Pat Ferrel <[email protected]>: >>>>>>>>>>> BTW the timeout is 1800 but the task in total runs over 9 hours >>>>>>>>>>> before >>>>>>>>>>> each >>>>>>>>>>> failure. This causes the job to take (after three tries) 27 hrs to >>>>>>>>>>> completely fail. Oh, bother... >>>>>>>>>>> >>>>>>>>>>> The timeout seems to be during the last map, so when the mappers >>>>>>>>>>> reach >>>>>>>>>>> 100% >>>>>>>>>>> but still running. Maybe some kind of cleanup is happening? >>>>>>>>>>> The first reducer is still "pending". The reducer never gets a >>>>>>>>>>> chance to >>>>>>>>>>> start. >>>>>>>>>>> >>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% >>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% >>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% >>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% >>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% >>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% >>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% >>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% >>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% >>>>>>>>>>> ---after a very long wait (9hrs or so) insert fail here---> >>>>>>>>>>> >>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs >>>>>>>>>>> 76,000 terms >>>>>>>>>>> >>>>>>>>>>> Any other info you need please ask. >>>>>>>>>>> >>>>>>>>>>> I'm about to try cranking it up to a couple hours for timeout but I >>>>>>>>>>> suspect >>>>>>>>>>> there is something else going on here. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote: >>>>>>>>>>>> I'm have a custom lucene stemming analyzer that filters out stop >>>>>>>>>>>> words >>>>>>>>>>>> and >>>>>>>>>>>> uses the following seq2sparse. The -x 40 is the only other thing >>>>>>>>>>>> that >>>>>>>>>>>> affects tossing frequent terms and as I understand things, >>>>>>>>>>>> tosses any >>>>>>>>>>>> term >>>>>>>>>>>> that appears in over 40% of the docs. >>>>>>>>>>>> >>>>>>>>>>>> mahout seq2sparse \ >>>>>>>>>>>> -i b2/seqfiles/ \ >>>>>>>>>>>> -o b2/vectors/ \ >>>>>>>>>>>> -ow \ >>>>>>>>>>>> -chunk 2000 \ >>>>>>>>>>>> -x 40 \ >>>>>>>>>>>> -seq \ >>>>>>>>>>>> -n 2 \ >>>>>>>>>>>> -nv \ >>>>>>>>>>>> -a com.finderbots.analyzers.LuceneStemmingAnalyzer >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote: >>>>>>>>>>>>> Hi Pat, >>>>>>>>>>>>> >>>>>>>>>>>>> have you removed highly frequent terms before launching >>>>>>>>>>>>> rowsimilarity >>>>>>>>>>>>> job? >>>>>>>>>>>>> >>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote: >>>>>>>>>>>>>> I've been trying to get a rowsimilarity job to complete. It >>>>>>>>>>>>>> continues >>>>>>>>>>>>>> to >>>>>>>>>>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task >>>>>>>>>>>>>> so I've >>>>>>>>>>>>>> upped the timeout to 30 minutes now. There are no errors in >>>>>>>>>>>>>> the logs >>>>>>>>>>>>>> that I can see and no other task I've tried is acting like >>>>>>>>>>>>>> this. Is >>>>>>>>>>>>>> this >>>>>>>>>>>>>> expected? Shouldn't the task check in more often? >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it >>>>>>>>>>>>>> is a bit >>>>>>>>>>>>>> slow anyway, still I shouldn't have to turn up the timeout so >>>>>>>>>>>>>> high >>>>>>>>>>>>>> should I? >>>>>>>>>>>>>> >>>> >>>> >>> >> -------------------------- >> Ken Krugler >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Mahout & Solr >> >> >> >> >> > > -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
