The best tests are really from real users. A/B test different recommenders and see which has better performance. That's not quite practical though.
The problem is that you don't even know what the best recommendations are. Splitting the data by date is reasonable, but recent items aren't necessarily most-liked. Splitting by rating is more reasonable on this point, but you still can't conclude that there aren't better recommendations from among the un-rated items. Still it out to correlate. I think you will find precision/recall are very low in most cases, often a few percent. The result is "noisy". AUC will tell you about where all of those "best recommendations" in the test set fell into the list, rather than only measuring the top N's performance. This tells you more, and I think that's generally good. However it is measuring performance over the entire list of recs, when you are unlikely to use more than the top N. Go ahead and use it since there's not a lot better you can do in the lab, but be aware of the issues.
