I am basically retracing generalization of the Bayesian inference problem given in Yahoo paper. I am too lazy to go back for a quote.
The SVD problem was discussed at meetups, basically the criticism here is that for RxC matrix whenever there's a missing measurement, one can't specify 'no measurement' but rather have to leave it at some neutral value (0? average?) which is essentially nothing but a noise since it's not a sample. As one guy from Stanford demonstrated on Netflix data, the whole system collapses very quickly after certain threshold of sample sparsity is reached. On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <[email protected]> wrote: > Dmitriy, > I am not clear what you are saying entirely, but as far as I can understand > your points, I think I disagree. Of course, if I don't catch your drift, I > might be wrong and we might be in agreement. > > On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> both Elkan's work and Yahoo's paper are based on the notion (which is >> confirmed by SGD experience) that if we try to substitute missing data with >> neutral values, the whole learning falls apart. Sort of. > > I don't see why you say that. Elkan and Yahoo want to avoid the cold start > process by using user and item offsets and by using latent factors to smooth > the recommendation process. > >> >> I.e. if we always know some context A (in this case, static labels and >> dyadic ids) and only sometimes some context B, then assuming neutral values >> for context B if we are missing this data is invalid because we are actually >> substituting unknown data with made-up data. > > This is abstract that I don't know what you are referring to really. Yes, > static characteristics will be used if they are available and latent factors > will be used if they are available. > >> >> Which is why SGD produces higher errors than necessary on sparsified label >> data. this is also the reason why SVD recommenders produce higher errors >> over sparse sample data as well (i think that's the consensus). > > I don't think I am part of that consensus. > SGD produces very low errors when used with sparse data. But it can also > use non-sparse features just as well. Why do you mean "higher errors than > necessary"? That lower error rates are possible with latent factor > techniques? > >> >> However, thinking in offline-ish mode, if we learn based on samples with A >> data, then freeze the learner and learn based on error between frozen >> learner for A and only the input that has context B, for learner B, then we >> are not making the mistake per above. At no point our learner takes any >> 'made-up' data. > > Are you talking about the alternating learning process in Menon and Elkan? > >> >> This whole notion is based on Bayesian inference process: what can you say >> if you only know A; and what correction would you make if you also new B. > > ?!?? > The process is roughly analogous to an EM algorithm, but not very. > >> >> Both papers do a corner case out of this: we have two types of data, A and >> B, and we learn A then freeze leaner A, then learn B where available. >> >> But general case doesn't have to be A and B. Actually that's our case (our >> CEO calls it 'trunk-brunch-leaf' case): We always know some context A, and >> sometimes B, and also sometimes we know all of A, B and some addiional >> context C. >> >> so there's a case to be made to generalize the inference architecture: >> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else. > > I think that these analogies are very strained. > >
