On our site we will use Logistic Regression in a batch manner, customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31) will be used to train the model, and customers entered in another time frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model, then the model will be used to predict users entered after 2011/6/1, does this make sense, or should we feed all data from 2010/1/1 to 2011/5/31 to ALR, and let it do the hold-out internally?
On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <[email protected]> wrote: > You don't *have* to have a separate validation set, but it isn't a bad idea. > > In particular, with large scale classifiers production data almost always > comes from the future with respect to the training data. The ADR can't hold > out that way because it does on-line training only. Thus, I would recommend > recommend that you still have some kind of evaluation hold-out set > segregated by time. > > Another very serious issue can happen if you have near duplicates in your > data set. That often happens in news-wire text, for example. In that case, > you would have significant over-fitting with ADR and you wouldn't have a > clue without a real time-segregated hold-out set. > > On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <[email protected]> wrote: > >> Hi, >> >> Because ADR split the training data internally automatically,so I >> think we don't have to make a separate validation data set. >> >> Regards, >> >> Xiaobo Gu >> >
