we have a "cross recommender" in production for about 3 month now, with the difference that we use lucene to build indices from map reduce directly plus we do the same thing for 30+ customers, most of them with different input data structure (field names, values).
we had something similar before (lucene, multiple cross relations) but also used the similarity score (llr) with a custom similarity and payloads but switched tp pure "tedism" after some helpful comments here. therefore i read this thread with a lot of interest. what i can add from my experiences: 1. i find it way easier to not talk about in this in matrix multiplication language but with contigency tables ( a and b, a and not b, not a and b, not a and not b), and also find the usage of the classical mahout similarity jobs hard. this is probably because of my basic matrix math skills, but also because using matrices leads to id usage and often the extracted items are text (search term, country, page section). thinking of this as related terms automatically gives a "document" view on the item to be recommended (the lucene doc) where description, name and everything is also just a field. 2. when doing a simple table it's just cooccurrences, marginals and totals. since the dimension of marginals is often not too big (items, browser, terms), we right now accumulate the counts in memory. maybe the RowSimilarityJob is working the same way. This can be changed to a different implementaton like on disk hash table or even count min sketch, if the number of items is too large. Main point is that the counting of marginals can be done on the fly when emitting all ooccurrences. 3. above in the thread there was a tip on approaching similarity scores with repeating terms. payloads are a better way for this and with lucene 4's doc values capability, there shouldn't be any mahout similarity not expressible by a lucene similarity. maybe it would be helpful to provide a lucene delivery system also for the "classic" mahout recommender package. it adds soo many possibilities for filtering and takes away a lot of point like caching etc. 4. a big question is the frequency of rebuilding. while the relations can often stay untouched for a day, the item data may change way more often (item churn, new items). it is therefore beneficial to separate those and have the possibility to rebuild the final index without calulating all similarities again (for very critical things this often leads to a lucene filter querying some external source to build up a lucene filter that restricts the index) besides that, i am very happy to see the ongoing effort on this topic and hope that i can contribute with something someday. Cheers, Johannes On Mon, Aug 5, 2013 at 10:27 PM, Ted Dunning <[email protected]> wrote: > On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel <[email protected]> wrote: > > > Yeah thought of that one too but it still requires each be ordered by > Key, > > in which case simultaneous iteration works in one pass I think. > > > > Multipass does not require ordering by key. Solr documents can be updated > in any order. > > > > If the DRMs are always sorted by Key you can iterate through each at the > > same time, writing only when you have both fields or know there is a > field > > missing from one DRM. If you get the same key you write a combined doc, > if > > you have different ones, write out one sided until it catches up to the > > other. > > > > Yes. Merge will work when files are ordered and split consistently. I > don't think we should be making that assumption. > > > > Every DRM I've examined seems to be ordered by key and I assume that is > > not an artifact of seqdumper. I'm using SequenceFileDirIterator so the > part > > file splits aren't a problem. > > > > But with the co- and cross- occurrence stuff, file splits could be a > problem. > > > > A m/r join is pretty simple too but I'll go with non-m/r unless there is > a > > problem above. > > > > The simplest join is to use Solr updates. This would require a minimal > amount of programming, but less than writing a merge program. > > > > BTW the schema for the Solr csv is: > > id,b_b_links,b_a_links > > item1,"itemX itemY","itemZ" > > > > am I missing some "normal metadata"? > > > > An item description is nice. >
