Sean's old Myrrix slides contain an overview of the fold-in math: http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares/14?src=clipshare
I never quite got around to actually incorporating it into my own ALS-based systems, because in the end I just re-computed models every day and found other ways to incorporate real-time elements using Elasticsearch. On Fri, 11 Mar 2016 at 01:12 Chris Fregly <[email protected]> wrote: > @Colin- you're asking the $1 million dollar question that a lot of people > are trying to do. This was literally the #1 most-asked question in every > city on my recent world-wide meetup tour. > > I've been pointing people to my old Databricks co-worker's > streaming-matrix-factorization project: > https://github.com/brkyvz/streaming-matrix-factorization He got tired of > everyone asking about this - and cranked it out over a weekend. Love that > guy, Burak! :) > > I've attempted (unsuccessfully, so far) to deploy exactly what you're > trying to do here: > https://github.com/fluxcapacitor/pipeline/blob/master/myapps/streaming/src/main/scala/com/advancedspark/streaming/rating/ml/TrainMFIncremental.scala > > We're a couple pull requests away from making this happen. You can see my > comments and open github issues for the remaining bits. > > And this will be my focus in the next week or so as I prepare for an > upcoming conference. Keep an eye on this repo if you'd like. > > @Sean: thanks for the link. I knew Oryx was doing this somehow - and I > kept meaning to see how you were doing it. I'll likely incorporate some of > your stuff into my final solution. > > > On Thu, Mar 10, 2016 at 3:35 PM, Sean Owen <[email protected]> wrote: > >> While it isn't crazy, I am not sure how valid it is to build a model >> off of only a chunk of recent data and then merge it into another >> model in any direct way. They're not really sharing a basis, so you >> can't just average them. >> >> My experience with this aspect suggests you should try to update the >> existing model in place on the fly. In short, you figure out how much >> the new input ought to change your estimate of the (user,item) >> association. Positive interactions should increase it a bit, etc. Then >> you work out how the item vector would change if the user vector were >> fixed in order to accomplish that change, with a bit of linear >> algebra. Vice versa for user vector. Of course, those changes affect >> the rest of the matrix too but that's the 'approximate' bit. >> >> I so happen to have an implementation of this in the context of a >> Spark ALS model, though raw source code may be hard to read. If it's >> of interest we can discuss offline (or online here to the extent it's >> relevant to Spark users) >> >> >> https://github.com/OryxProject/oryx/blob/91004a03413eef0fdfd6e75a61b68248d11db0e5/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L192 >> >> On Thu, Mar 10, 2016 at 8:01 PM, Colin Woodbury <[email protected]> >> wrote: >> > Hi there, I'm wondering if it's possible (or feasible) to combine the >> > feature matrices of two MatrixFactorizationModels that share a user and >> > product set. >> > >> > Specifically, one model would be the "on-going" model, and the other is >> one >> > trained only on the most recent aggregation of some event data. My >> overall >> > goal is to try to approximate "online" training, as ALS doesn't support >> > streaming, and it also isn't possible to "seed" the ALS training process >> > with an already trained model. >> > >> > Since the two Models would share a user/product ID space, can their >> feature >> > matrices be merged? For instance via: >> > >> > 1. Adding feature vectors together for user/product vectors that appear >> in >> > both models >> > 2. Averaging said vectors instead >> > 3. Some other linear algebra operation >> > >> > Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS >> > itself. Is what I'm asking possible? >> > >> > Thank you, >> > Colin >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > > *Chris Fregly* > Principal Data Solutions Engineer > IBM Spark Technology Center, San Francisco, CA > http://spark.tc | http://advancedspark.com >
