For the UR we have scripts that do this instead of the Evaluation APIs, which are pretty limited and to not do what we want, which is hyper-prameter search. This requires changing some params that require models to be re-created, and other tests vary query params. All these are only possible with scripts that control the whole system from the outside.
On Nov 24, 2017, at 9:42 AM, Pat Ferrel <[email protected]> wrote: Yes, this is what we do. We split by date into 10-90 or 20-80. The metric we use is MAP@k for precision and as a proxy for recall we look at the % of people in the test set that get recs (turn off popularity backfill or everyone will get some kind of recs, if only popular ones. The more independent events you have in the data the larger your recall number will be. Expect small precision numbers, they are on average but larger is better. Do not use it to compare different algorithms, only A/B tests work for that no matter what the academics do. Use your cross-validation scores to compare tunings. Start with the default for everything as your baseline and tune from there. On Nov 24, 2017, at 12:54 AM, Andy Rao <[email protected]> wrote: Hi, I have successfully trained our rec model using universal recommender, but I do not know how to evaluate the trained model. The first idea come from my head is to split our dataset into train and test dataset, and then use recall metrics evaluate. But I'm not sure whether this is a good idea or not. Any help or suggestion is much appreciated. Hongyao
