Yes, I was referring to Andrea Montanari. My apologies for "guy from Stanford" reference. I wasn't aware of the paper but I was present at his talk about his work, it was quite informative.
On Wed, Feb 2, 2011 at 11:49 PM, Federico Castanedo <[email protected]>wrote: > Hi all, > > Dimitry, I guess you are talking about this paper of Andrea Montanari, am i > correct? > > Matrix Completion from Noisy Entries. http://arxiv.org/abs/0906.2027v1 > > 2011/2/3 Dmitriy Lyubimov <[email protected]> > > > I am basically retracing generalization of the Bayesian inference > > problem given in Yahoo paper. I am too lazy to go back for a quote. > > > > The SVD problem was discussed at meetups, basically the criticism > > here is that for RxC matrix whenever there's a missing measurement, > > one can't specify 'no measurement' but rather have to leave it at some > > neutral value (0? average?) which is essentially nothing but a noise > > since it's not a sample. As one guy from Stanford demonstrated on > > Netflix data, the whole system collapses very quickly after certain > > threshold of sample sparsity is reached. > > > > On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <[email protected]> > wrote: > > > Dmitriy, > > > I am not clear what you are saying entirely, but as far as I can > > understand > > > your points, I think I disagree. Of course, if I don't catch your > drift, > > I > > > might be wrong and we might be in agreement. > > > > > > On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > >> > > >> both Elkan's work and Yahoo's paper are based on the notion (which is > > >> confirmed by SGD experience) that if we try to substitute missing data > > with > > >> neutral values, the whole learning falls apart. Sort of. > > > > > > I don't see why you say that. Elkan and Yahoo want to avoid the cold > > start > > > process by using user and item offsets and by using latent factors to > > smooth > > > the recommendation process. > > > > > >> > > >> I.e. if we always know some context A (in this case, static labels and > > >> dyadic ids) and only sometimes some context B, then assuming neutral > > values > > >> for context B if we are missing this data is invalid because we are > > actually > > >> substituting unknown data with made-up data. > > > > > > This is abstract that I don't know what you are referring to really. > > Yes, > > > static characteristics will be used if they are available and latent > > factors > > > will be used if they are available. > > > > > >> > > >> Which is why SGD produces higher errors than necessary on sparsified > > label > > >> data. this is also the reason why SVD recommenders produce higher > errors > > >> over sparse sample data as well (i think that's the consensus). > > > > > > I don't think I am part of that consensus. > > > SGD produces very low errors when used with sparse data. But it can > also > > > use non-sparse features just as well. Why do you mean "higher errors > > than > > > necessary"? That lower error rates are possible with latent factor > > > techniques? > > > > > >> > > >> However, thinking in offline-ish mode, if we learn based on samples > with > > A > > >> data, then freeze the learner and learn based on error between frozen > > >> learner for A and only the input that has context B, for learner B, > then > > we > > >> are not making the mistake per above. At no point our learner takes > any > > >> 'made-up' data. > > > > > > Are you talking about the alternating learning process in Menon and > > Elkan? > > > > > >> > > >> This whole notion is based on Bayesian inference process: what can you > > say > > >> if you only know A; and what correction would you make if you also new > > B. > > > > > > ?!?? > > > The process is roughly analogous to an EM algorithm, but not very. > > > > > >> > > >> Both papers do a corner case out of this: we have two types of data, A > > and > > >> B, and we learn A then freeze leaner A, then learn B where available. > > >> > > >> But general case doesn't have to be A and B. Actually that's our case > > (our > > >> CEO calls it 'trunk-brunch-leaf' case): We always know some context A, > > and > > >> sometimes B, and also sometimes we know all of A, B and some addiional > > >> context C. > > >> > > >> so there's a case to be made to generalize the inference architecture: > > >> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever > else. > > > > > > I think that these analogies are very strained. > > > > > > > > >
