Yahoo is building what they say 2-stage hierarchical model. I am not arguing that they use EM etc. to solve individual stages. I understand that. I am not arguing that they are primarily motivated by solving cold start problem. I understand that as well.
but what they build is similar reasoning, if not the same, as here : http://en.wikipedia.org/wiki/Hierarchical_Bayes_model Is it not? It is possible i am mixing things here, this hierarchy is not directly Bayesian, but motivation here is similar? I am just saying that we can generalize problem to hierarchies that don't have to be 2-stage. That's all. I am also saying that a practical problem i have at hand is also more than 2 stage. I don't know what would be the best way to solve it. But it seems to me that hierarchical learning analogous to these could be extended to a more general case with multiple hierarchies on the side info or even user/item content profiles. For example, say sometimes user & item interact and you always know time of the day when it happen. (just sheer example). but sometimes (far from always) you also happen to know the weather. or/and Geo where it happen. Can't we make use of that information with an addiiton of another stage to the hierarchy? On Wed, Feb 2, 2011 at 8:54 PM, Dmitriy Lyubimov <[email protected]> wrote: > I am basically retracing generalization of the Bayesian inference > problem given in Yahoo paper. I am too lazy to go back for a quote. > > The SVD problem was discussed at meetups, basically the criticism > here is that for RxC matrix whenever there's a missing measurement, > one can't specify 'no measurement' but rather have to leave it at some > neutral value (0? average?) which is essentially nothing but a noise > since it's not a sample. As one guy from Stanford demonstrated on > Netflix data, the whole system collapses very quickly after certain > threshold of sample sparsity is reached. > > On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <[email protected]> wrote: >> Dmitriy, >> I am not clear what you are saying entirely, but as far as I can understand >> your points, I think I disagree. Of course, if I don't catch your drift, I >> might be wrong and we might be in agreement. >> >> On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> both Elkan's work and Yahoo's paper are based on the notion (which is >>> confirmed by SGD experience) that if we try to substitute missing data with >>> neutral values, the whole learning falls apart. Sort of. >> >> I don't see why you say that. Elkan and Yahoo want to avoid the cold start >> process by using user and item offsets and by using latent factors to smooth >> the recommendation process. >> >>> >>> I.e. if we always know some context A (in this case, static labels and >>> dyadic ids) and only sometimes some context B, then assuming neutral values >>> for context B if we are missing this data is invalid because we are actually >>> substituting unknown data with made-up data. >> >> This is abstract that I don't know what you are referring to really. Yes, >> static characteristics will be used if they are available and latent factors >> will be used if they are available. >> >>> >>> Which is why SGD produces higher errors than necessary on sparsified label >>> data. this is also the reason why SVD recommenders produce higher errors >>> over sparse sample data as well (i think that's the consensus). >> >> I don't think I am part of that consensus. >> SGD produces very low errors when used with sparse data. But it can also >> use non-sparse features just as well. Why do you mean "higher errors than >> necessary"? That lower error rates are possible with latent factor >> techniques? >> >>> >>> However, thinking in offline-ish mode, if we learn based on samples with A >>> data, then freeze the learner and learn based on error between frozen >>> learner for A and only the input that has context B, for learner B, then we >>> are not making the mistake per above. At no point our learner takes any >>> 'made-up' data. >> >> Are you talking about the alternating learning process in Menon and Elkan? >> >>> >>> This whole notion is based on Bayesian inference process: what can you say >>> if you only know A; and what correction would you make if you also new B. >> >> ?!?? >> The process is roughly analogous to an EM algorithm, but not very. >> >>> >>> Both papers do a corner case out of this: we have two types of data, A and >>> B, and we learn A then freeze leaner A, then learn B where available. >>> >>> But general case doesn't have to be A and B. Actually that's our case (our >>> CEO calls it 'trunk-brunch-leaf' case): We always know some context A, and >>> sometimes B, and also sometimes we know all of A, B and some addiional >>> context C. >>> >>> so there's a case to be made to generalize the inference architecture: >>> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else. >> >> I think that these analogies are very strained. >> >> >
