A note about todays hangout regarding the cross-recommender. In general it may be good way to think about the current and proposed system as two pipelines:
1) a pipeline that takes preference data, turn it into two preference matrices in Mahout DRM form and creates [B'B] and [B'A] ideally using LLR Row and CrossRowSimilairtyJobs. This generates two DRMs with Mahout Keys and VectorWritable(s) with internal numerical Mahout IDs. There is one ID space for B and one for A. In the github repo these also create recommendations in Mahout form as an Items-based RecommenderJob and XRecommenderJob. This last step is not needed when using Solr but may be useful for comparison. These Jobs are all mapreduce and closely match the Mahout code and model of calculation. 2) a pipeline that processes IDs and other metadata contained in the logs. The IDs are user IDs in string form as are the Items IDs. But the Items for A action may be completely different from B. This cross-recommender ties the two together with a generalized notion of significant cooccurrence using by executing the #1 pipeline and using the results. These log file IDs are what gets written out to Solr. Which IDs is encoded in the two Mahout generated DRMs. The pipeline may need to bring along other metadata mined from the logs like item descriptions, tags, categories, etc. Note: This is last bit is not build in at present but would make Solr queries even better. Also at present A and B are assumed to have the same item IDs. This works for purchase+view actions and other but not for some cross-actions that would be useful like music track listen + tagged category listen -> track recommendation or music tagged category listen+track listen -> category recommendation. The current action items are: 1) #1 is running and works but eventually needs to be reintegrated with new Mahout trunk code--my action item, with Sebastian's help. 2) #2 needs to write the merged DRMs to Solr as one doc per row and 3 fields per doc (id, B'B, B'A)--I'm is working on this now. 3) To generalize further we need to account for different ID spaces in #2 and I'll take that as an action item. 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz.
