A note about todays hangout regarding the cross-recommender.

In general it may be good way to think about the current and proposed system as 
two pipelines:

1) a pipeline that takes preference data, turn it into two preference matrices 
in Mahout DRM form and creates [B'B] and [B'A] ideally using LLR Row and 
CrossRowSimilairtyJobs. This generates two DRMs with Mahout Keys and 
VectorWritable(s) with internal numerical Mahout IDs. There is one ID space for 
B and one for A. In the github repo these also create recommendations in Mahout 
form as an Items-based RecommenderJob and XRecommenderJob. This last step is 
not needed when using Solr but may be useful for comparison. These Jobs are all 
mapreduce and closely match the Mahout code and model of calculation. 

2) a pipeline that processes IDs and other metadata contained in the logs. The 
IDs are user IDs in string form as are the Items IDs. But the Items for A 
action may be completely different from B. This cross-recommender ties the two 
together with a generalized notion of significant cooccurrence using by 
executing the #1 pipeline and using the results. These log file IDs are what 
gets written out to Solr. Which IDs is encoded in the two Mahout generated 
DRMs. The pipeline may need to bring along other metadata mined from the logs 
like item descriptions, tags, categories, etc. Note: This is last bit is not 
build in at present but would make Solr queries even better. Also at present A 
and B are assumed to have the same item IDs. This works for purchase+view 
actions and other but not for some cross-actions that would be useful like 
music track listen + tagged category listen -> track recommendation or music 
tagged category listen+track listen -> category recommendation.

The current action items are:
1) #1 is running and works but eventually needs to be reintegrated with new 
Mahout trunk code--my action item, with Sebastian's help.
2) #2 needs to write the merged DRMs to Solr as one doc per row and 3 fields 
per doc (id, B'B, B'A)--I'm is working on this now.
3) To generalize further we need to account for different ID spaces in #2 and 
I'll take that as an action item.
4) To add more metadata to the Solr output will be left to the consumer for 
now. If there is a good data set to use we can illustrate how to do it in the 
project. Ted may have some data for this from musicbrainz.

Reply via email to