At the moment, the down sampling is done by PreparePreferenceMatrixJob for the collaborative filtering functionality. We just want to move it down to RowSimilarityJob to enable standalone usage.
I think that the CrossRecommender should be the next thing on our agenda, after we have the deployment infrastructure. I especially like that its capable to include different kinds of interactions, as opposed to most other (academically motivated) recommenders that focus on a single interaction type like a rating. --sebastian On 22.07.2013 02:14, Ted Dunning wrote: > The row similarity downsampling is just a matter of dropping elements at > random from rows that have more data than we want. > > If the join that puts the row together can handle two kinds of input, then > RowSimilarity can be easily modified to be CrossRowSimilarity. Likewise, > if we have two DRM's with the same row id's in the same order, we can do a > map-side merge. Such a merge can be very efficient on a system like MapR > where you can control files to live on the same nodes. > > > On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel <[email protected]> wrote: > >> RowSimilarity downsampling? Are you referring to the a mod of the matrix >> multiply to do cross similarity with LLR for the cross recommendations? So >> similarity of rows of B with rows of A? >> >> Sounds like you are proposing not only putting a recommender in Solr but >> also a cross-recommender? This is why getting a real data set is >> problematic? >> >> On Jul 21, 2013, at 3:40 PM, Ted Dunning <[email protected]> wrote: >> >> Pat, >> >> Yes. The first part probably just is the RowSimilarity job, especially >> after Sebastian puts in the down-sampling. >> >> The new part is exactly as you say, storing the DRM into Solr indexes. >> >> There is no reason to not use a real data set. There is a strong reason to >> use a synthetic dataset, however, in that it can be trivially scaled up and >> down both in items and users. Also, the synthetic dataset doesn't require >> that the real data be found and downloaded. >> >> >> >> On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel <[email protected]> wrote: >> >>> Read the paper, and the preso. >>> >>> As to the 'offline to Solr' part. It sounds like you are suggesting an >>> item item similarity matrix be stored and indexed in Solr. One would have >>> to create the action matrix from user profile data (preference history), >> do >>> a rowsimiarity job on it (using LLR similarity) and move the result to >>> Solr. The first part of this is nearly identical to the current >> recommender >>> job workflow and could pretty easily be created from it I think. The new >>> part is taking the DistributedRowMatrix and storing it in a particular >> way >>> in Solr, right? >>> >>> BTW Is there some reason not to use an existing real data set? >>> >>> On Jul 19, 2013, at 3:45 PM, Ted Dunning <[email protected]> wrote: >>> >>> OK. I think the crux here is the off-line to Solr part so let's see who >>> else pops up. >>> >>> Having a solr maven could be very helpful. >>> >>> >>> >> >> >
