Hi all, Dimitry, I guess you are talking about this paper of Andrea Montanari, am i correct?
Matrix Completion from Noisy Entries. http://arxiv.org/abs/0906.2027v1 2011/2/3 Dmitriy Lyubimov <[email protected]> > I am basically retracing generalization of the Bayesian inference > problem given in Yahoo paper. I am too lazy to go back for a quote. > > The SVD problem was discussed at meetups, basically the criticism > here is that for RxC matrix whenever there's a missing measurement, > one can't specify 'no measurement' but rather have to leave it at some > neutral value (0? average?) which is essentially nothing but a noise > since it's not a sample. As one guy from Stanford demonstrated on > Netflix data, the whole system collapses very quickly after certain > threshold of sample sparsity is reached. > > On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <[email protected]> wrote: > > Dmitriy, > > I am not clear what you are saying entirely, but as far as I can > understand > > your points, I think I disagree. Of course, if I don't catch your drift, > I > > might be wrong and we might be in agreement. > > > > On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> > >> both Elkan's work and Yahoo's paper are based on the notion (which is > >> confirmed by SGD experience) that if we try to substitute missing data > with > >> neutral values, the whole learning falls apart. Sort of. > > > > I don't see why you say that. Elkan and Yahoo want to avoid the cold > start > > process by using user and item offsets and by using latent factors to > smooth > > the recommendation process. > > > >> > >> I.e. if we always know some context A (in this case, static labels and > >> dyadic ids) and only sometimes some context B, then assuming neutral > values > >> for context B if we are missing this data is invalid because we are > actually > >> substituting unknown data with made-up data. > > > > This is abstract that I don't know what you are referring to really. > Yes, > > static characteristics will be used if they are available and latent > factors > > will be used if they are available. > > > >> > >> Which is why SGD produces higher errors than necessary on sparsified > label > >> data. this is also the reason why SVD recommenders produce higher errors > >> over sparse sample data as well (i think that's the consensus). > > > > I don't think I am part of that consensus. > > SGD produces very low errors when used with sparse data. But it can also > > use non-sparse features just as well. Why do you mean "higher errors > than > > necessary"? That lower error rates are possible with latent factor > > techniques? > > > >> > >> However, thinking in offline-ish mode, if we learn based on samples with > A > >> data, then freeze the learner and learn based on error between frozen > >> learner for A and only the input that has context B, for learner B, then > we > >> are not making the mistake per above. At no point our learner takes any > >> 'made-up' data. > > > > Are you talking about the alternating learning process in Menon and > Elkan? > > > >> > >> This whole notion is based on Bayesian inference process: what can you > say > >> if you only know A; and what correction would you make if you also new > B. > > > > ?!?? > > The process is roughly analogous to an EM algorithm, but not very. > > > >> > >> Both papers do a corner case out of this: we have two types of data, A > and > >> B, and we learn A then freeze leaner A, then learn B where available. > >> > >> But general case doesn't have to be A and B. Actually that's our case > (our > >> CEO calls it 'trunk-brunch-leaf' case): We always know some context A, > and > >> sometimes B, and also sometimes we know all of A, B and some addiional > >> context C. > >> > >> so there's a case to be made to generalize the inference architecture: > >> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else. > > > > I think that these analogies are very strained. > > > > >
