On Feb 10, 2013, at 2:39pm, Johannes Schulte wrote: > Hi, > > i am currently implementing a system of the same kind, LLR sparsified > "term"-cooccurrence vectors in lucene (since not a day goes by where i see > Ted praising this). > There are not only views and purchases, but also search terms, facets and a > lot more textual information to be included in the cooccurrence matrix (as > "input"). > That's why i went with the feature hashing framework in mahout. This gives > small (hd/mem) user profiles and allows for reusing the vectors for click > prediction and/or clustering. The main difference is that there's only two > fields in lucene with a lot of terms (Numbers), representing the features. > Two fields because i think predicting views (besides purchases) might in > some cases be better than predicting nothing. > I don't think it should make a big differing in scoring because in a > vector space model used by most engines it's just, well a vector space and > i don't know if the field norm make sense after stripping values from the > term vectors with the LLR threshold. > > @Ted >> It is handy to simply use the binary values of the sparsified versions of >> these and let the search engine handle the weighting of different >> components at query time. > > Do you really want to omit the cooccurrence counts which would become the > term frequecies? how would the engine then weight different inputs against > each other? > > And, > if anyone knows a > 1. smarter way to index the cooccurrence counts in lucene than a > tokenstream that emits a word k times for a cooccurrence count of k
I haven't been following this discussion, but in general using payloads is a way of providing additional information about a term that can be used for scoring. > 2. way to avoid treating the (hashed) vector column indices as terms but > reusing them? It's a bit weird hashing to an int and then having the lucene > term dictionary treating them as string, mapping to another int Is there a performance/size issue here? Also I'm assuming you're using the solr.TrieIntField field type (not the string-ified value). -- Ken > On Sun, Feb 10, 2013 at 6:36 PM, Ted Dunning <[email protected]> wrote: > >> Actually treating the different interactions separately can lead to very >> good recommendations. The only issue is that the interactions are no >> longer dyadic. >> >> If you think about it, having two different kinds of interactions is like >> adjoining interaction matrices for the two different kinds of interaction. >> Suppose that you have user x views in matrix A and you have user x >> purchases in matrix B. The complete interaction matrix of user x (views + >> purchases) is [A | B]. >> >> When you compute cooccurrence in this matrix, you get >> >> [A | B] = [ A' ] [ A' A A' B ] >> [A | B]' [A | B] = [ ] [A | B] = [ ] >> [A | B] = [ B' ] [ B' A B' B ] >> >> This matrix is (view + purchase) x (view + purchase). But we don't care >> about predicting views so we only really need a matrix that is purchase x >> (view >> + purchase). This is just the bottom part of the matrix above, or [ B'A | >> B'B ]. When you produce purchase recommendations r_p by multiplying by a >> mixed view and purchase history vector h which has a view part h_v and a >> purchase part h_p, you get >> >> r_p = [ B' A B' B ] h = B'A h_v + B'B h_p >> >> That is a prediction of purchases based on past views and past purchase. >> >> Note that this general form applies for both decomposition techniques such >> as SVD, ALS and LLL as well as for sparsification techniques such as the >> LLR sparsification. All that changes is the mechanics of how you do the >> multiplications. Weighting of components works the same as well. >> >> What is very different here is that we have a component of cross >> recommendation. That is the B'A in the formula above. This is very >> different from a normal recommendation and cannot be computed with the >> simple self-join that we normally have in Mahout cooccurrence computation >> and also very different from the decompositions that we normally do. It >> isn't hard to adapt the Mahout computations, however. >> >> When implementing a recommendation using a search engine as the base, these >> same techniques also work extremely well in my experience. What happens is >> that for each item that you would like to recommend, you would have one >> field that has components of B'A and one field that has components of B'B. >> It is handy to simply use the binary values of the sparsified versions of >> these and let the search engine handle the weighting of different >> components at query time. Having these components separated into different >> fields in the search index seems to help quite a lot, which makes a fair >> bit of sense. >> >> On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen <[email protected]> wrote: >>> >>> I think you'd have to hack the code to not exclude previously-seen items, >>> or at least, not of the type you wish to consider. Yes you would also >> have >>> to hack it to add rather than replace existing values. Or for test >>> purposes, just do the adding yourself before inputting the data. >>> >>> My hunch is that it will hurt non-trivially to treat different >> interaction >>> types as different items. You probably want to predict that someone who >>> viewed a product over and over is likely to buy it, but this would only >>> weakly tend to occur if the bought-item is not the same thing as the >>> viewed-item. You'd learn they go together but not as strongly as ought to >>> be obvious from the fact that they're the same. Still, interesting >> thought. >>> >>> There ought to be some 'signal' in this data, just a question of how much >>> vs noise. A purchase means much more than a page view of course; it's not >>> as subject to noise. Finding a means to use that info is probably going >> to >>> help. >>> >>> >>> >>> >>> On Sat, Feb 9, 2013 at 7:50 PM, Pat Ferrel <[email protected]> wrote: >>> >>>> I'd like to experiment with using several types of implicit preference >>>> values with recommenders. I have purchases as an implicit pref of high >>>> strength. I'd like to see if add-to-cart, view-product-details, >>>> impressions-seen, etc. can increase offline precision in holdout tests. >>>> These less than obvious implicit prefs will get a much lower value than >>>> purchase and i'll experiment with different mixes. The problem is that >> some >>>> of these prefs will indicate that the user, for whom I'm getting recs, >> has >>>> expressed a preference. >>>> >>>> Using these implicit prefs seems reasonable in finding similarity of >> taste >>>> between users but presents several problems. 1) how to encode the >> prefs, >>>> each impression-seen will increase the strength of preference of a user >> for >>>> an item but the recommender framework replaces the preference value for >>>> items preferred more than once, doesn't it? 2) AFAIK the current >>>> recommender framework will return recs only for items that the user in >>>> question has expressed no preference for. If I use something like >>>> view-product-details or impressions-seen, I will be removing anything >> the >>>> user has seen from the recs, which is not what I want in this >> experiment. >>>> >>>> Has anyone tried something like this? I'm not convinced that these >> other >>>> implicit preferences will add anything to the recommender, just trying >> to >>>> find out. >> -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
