Pat and Ted: I am late to the party but this is very interesting! I am not sure I understand all the steps, though. Do you still create a cooccurrence matrix and compute LLR scores during this process or do you only compute matrix multiplication times the history vector: B'B * h and B'A * h?
Cheers, Frank On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel <[email protected]> wrote: > I finally got some time to work on this and have a first cut at output to > Solr working on the github repo. It only works on 2-action input but I'll > have that cleaned up soon so it will work with one action. Solr indexing > has not been tested yet and the field names and/or types may need tweaking. > > It takes the result of the previous drop: > 1) DRMs for B (user history or B items action1) and A (user history of A > items action2) > 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence > > There are two final outputs created using mapreduce but requiring 2 > in-memory hashmaps. I think this will work on a cluster (the hashmaps are > instantiated on each node) but haven't tried yet. It orders items in #2 > fields by strength of "link", which is the similarity value used in [B'B] > or [B'A]. It would be nice to order #1 by recency but there is no provision > for passing through timestamps at present so they are ordered by the > strength of preference. This is probably not useful and so can be ignored. > Ordering by recency might be useful for truncating queries by recency while > leaving the training data containing 100% of available history. > > 1) It joins #1 DRMs to produce a single set of docs in CSV form, which > looks like this: > id,history_b,history_a > user1,iphone ipad,iphone ipad galaxy > ... > > 2) it joins #2 DRMs to produce a single set of docs in CSV form, which > looks like this: > id,b_b_links,b_a_links > u1,iphone ipad,iphone ipad galaxy > … > > It may work on a cluster, I haven't tried yet. As soon as someone has some > large-ish sample log files I'll give them a try. Check the sample input > files in the resources dir for format. > > https://github.com/pferrel/solr-recommender > > > On Aug 13, 2013, at 10:17 AM, Pat Ferrel <[email protected]> wrote: > > When I started looking at this I was a bit skeptical. As a Search engine > Solr may be peerless, but as yet another NoSQL db? > > However getting further into this I see one very large benefit. It has one > feature that sets it completely apart from the typical NoSQL db. The type > of queries you do return fuzzy results--in the very best sense of that > word. The most interesting queries are based on similarity to some > exemplar. Results are returned in order of similarity strength, not ordered > by a sort field. > > Wherever similarity based queries are important I'll look at Solr first. > SolrJ looks like an interesting way to get Solr queries on POJOs. It's > probably at least an alternative to using docs and CSVs to import the data > from Mahout. > > > > On Aug 12, 2013, at 2:32 PM, Ted Dunning <[email protected]> wrote: > > Yes. That would be interesting. > > > > > On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan <[email protected]> wrote: > > > A little digression: Might a Matrix implementation backed by a Solr index > > and uses SolrJ for querying help at all for the Solr recommendation > > approach? > > > > It supports multiple fields of String, Text, or boolean flags. > > > > Best > > Gokhan > > > > > > On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel <[email protected]> wrote: > > > >> Also a question about user history. > >> > >> I was planning to write these into separate directories so Solr could > >> fetch them from different sources but it occurs to me that it would be > >> better to join A and B by user ID and output a doc per user ID with > three > >> fields, id, A item history, and B item history. Other fields could be > > added > >> for users metadata. > >> > >> Sound correct? This is what I'll do unless someone stops me. > >> > >> On Aug 7, 2013, at 11:25 AM, Pat Ferrel <[email protected]> wrote: > >> > >> Once you have a sample or example of what you think the > >> "log file" version will look like, can you post it? It would be great to > >> have example lines for two actions with or without the same item IDs. > > I'll > >> make sure we can digest it. > >> > >> I thought more about the ingest part and I don't think the > one-item-space > >> is actually a problem. It just means one item dictionary. A and B will > > have > >> the right content, all I have to do is make sure the right ranks are > > input > >> to the MM, > >> Transpose, and RSJ. This in turn is only one extra count of the # of > > items > >> in A's item space. This should be a very easy change If my thinking is > >> correct. > >> > >> > >> On Aug 7, 2013, at 8:09 AM, Ted Dunning <[email protected]> wrote: > >> > >> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel <[email protected]> > wrote: > >> > >>> 4) To add more metadata to the Solr output will be left to the consumer > >>> for now. If there is a good data set to use we can illustrate how to do > >> it > >>> in the project. Ted may have some data for this from musicbrainz. > >> > >> > >> I am working on this issue now. > >> > >> The current state is that I can bring in a bunch of track names and > links > >> to artist names and so on. This would provide the basic set of items > >> (artists, genres, tracks and tags). > >> > >> There is a hitch in bringing in the data needed to generate the logs > > since > >> that part of MB is not Apache compatible. I am working on that issue. > >> > >> Technically, the data is in a massively normalized relational form right > >> now, but it isn't terribly hard to denormalize into a form that we need. > >> > >> > >> > > > > > >
