Points from across several e-mails -- The initial item-feature matrix can be just random unit vectors too. I have slightly better results with that.
You are finding the least-squares solution of A = U M' for U given A and M. Yes you can derive that analytically as the zero of the derivative of the error function. With m users and n items, and k features, where k=n, then I suppose you don't need any iterations at all since there is a trivial solution: U = A, M = I(n) (the nxn identity matrix). You would not find this on the first iteration, however, if you followed the algorithm, because you would be starting from some random starting point. But if you initialized M to the identity matrix, yes you'd find the exact solution immediately. Yes it is an iterative algorithm and you have to define a convergence criterion. I use average absolute difference in (U M') entries from one iteration to the next. (Well, a sample.) It's possible that you reach your criterion in 1 iteration, or, not. It depends on the criterion. Usually when you restart ALS on updated data, you use the previous M matrix as a starting point. If the change in data is small, one iteration should usually leave you still "converged" actually. But, from random starting point -- not typical. ALS is similar to SVD only in broad terms. The SVD is not always used to make a low-rank factorization, and, its outputs carry more guarantees -- they are orthonormal bases because it has factored out scaling factors into the third matrix Sigma. I think of ALS as more simplistic and therefore possibly faster. WIth k features I believe (?) the SVD is necessarily a k-iteration process at least, whereas ALS might be of use after 1 iteration. The SVD is not a "shortcut" for ALS. If you go to the trouble of a full SVD, you can certainly use that factorization as-is though. You need regularization. It should be pointed out that the "ALS" often spoken of here is not one that factors the input matrix A. There's a variant, that I have had good results with, for 'implicit' feedback. There, you are actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using values in A as weights in the loss function. You're reconstructing "interacted or not" and using input value as a confidence measure. This "works" for ratings although the interpretation in this variant doesn't line up with the nature of ratings. It works quite nicely for things like clicks, etc. (Much more can be said on this point.) On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner <[email protected]> wrote: > It's quite hard for me to get the mathematical concepts of the ALS > recommenders. It would be great if someone could help me to figure out > the details. This is my current status: > > 1. The item-feature (M) matrix is initialized using the average ratings > and random values (explicit case) > > 2. The user-feature (U) matrix is solved using the partial derivative of > the error function with respect to u_i (the columns of row-vectors of U) > > Supposed we use as many features as items are known and the error > function does not use any regularization. Would U be solved within the > first iteration? If not, I do not understand why more than one iteration > is needed. > Furthermore, I believe to have understood that using fewer features than > items and also applying regularization, does not allow to solve U in a > way that the stopping criterion can be met after only one iteration. > Thus, iteration is required to gradually converge to the stopping > criterion. > > I hope I have pointed out my problems clearly enough. >
