Probably a question for Sebastian. As we know, the two papers (Hu-Koren-Volynsky and Zhou et. al) use slightly different loss functions.
Zhou et al. are fairly unique in that they multiply norm of U, V vectors additionally by the number of observied interactions. The paper doesn't explain why it works except saying along the lines of "we tried several regularization matrices, and this one worked better in our case". I tried to figure why that is. And still not sure why it would be better. So b asically we say, by allowing smaller sets of observation having smaller regularization values, it is ok for smaller observation sets to overfit slightly more than larger observations sets. This seems to be counterintuitive. Intuition tells us, smaller sets actually would tend to overfit more, not less, and therefore might possibly use larger regularization rate, not smaller one. Sebastian, what's your take on weighing regularization in ALS-WR? thanks. -d
