Just to make sure if I understood correctly, Ted, could you please correct me?:)
1. Using a search engine, I will treat items as documents, where each document vector consists of other items (similar to "words of documents") with co-occurrence (LLR) weights (instead of tf-idf in a search engine analogy). So for each item I will have a sparse vector that represents the relation of that item to other items, if there is an indicator that makes the item-to-item similarity (co-occurrence) non-zero. (I will only use positive feedback, I think, since I am counting co-occurrences) 2. To present recommendations, the system formulates a "query", with a history of items --the session history for task based recommendation, or a long term history. And the search engine will find top-N items, based on the cosine similarities of the item (document) vectors and history (query) vectors. 3. For example, if that was a restaurant recommendation, and we knew that the restaurant was famous for its sushi, I would index this in another field, "famous_for". Now if, as a user, I asked for sushi restaurants that I would enjoy, the system would add this to query along with my history, and the famous sushi restaurant would rank higher in results, even if chances are equal that I would like a steakhouse according to the computation in 2. 4. Since this is a search engine, and a search engine can boost a particular field, the system would let the "famous_for" overweigh the collaborative activity, or the opposite (According to the use case, or for example, number of items in the history) So I can define a weighting (voting, or mixture of experts) scheme to "blend" different recommenders. Are those correct? Gokhan On Mon, Jul 22, 2013 at 9:07 PM, Michael Sokolov < msoko...@safaribooksonline.com> wrote: > On 07/22/2013 12:20 PM, Pat Ferrel wrote: > >> >> My understanding of the Solr proposal puts B's row similarity matrix in a >> vector per item. That means each row is turned into "terms" = external >> IDs--not sure how the weights of each term are encoded. >> > This is the key question for me. The best idea I've had is to use termFreq > as a proxy for weight. It's only an integer, so there are scaling issues > to consider, but you can apply a per-field weight to manage that. Also, > Lucene (and Solr) doesn't provide an obvious way to load term frequencies > directly: probably the simplest thing to do is just to repeat the > cross-term N times and let the text analysis take care of counting them. > Inefficient, but probably the quickest way to get going. Alternatively, > there are some lower level Lucene indexing APIs (DocFieldConsumer et al) > which I haven't really plumbed entirely, but would allow for more direct > loading of fields. > > Then one probably wants to override the scoring in some way (unless TFIDF > is the way to go somehow??) > >