Keeping old models is one thing. Keeping track of exactly which data you trained with is another thing.
Since you often need access to both old and new models at the same time, it is common to simply burn a serial number into the file containing the model and simply keep all of them. You need to keep a record as well of which model resulted from which data using which build of the training software. This leads to the question of how you make sure you know what training data you used. If your data is relatively small, then making a copy is a fine idea. As your input data gets bigger and in a production setting where data is coming in all the time, you may find that you need to start using something like a snapshot so that you don't actually use n times the storage for your data and so that you get an exact moment in time for the training data. On Wed, Jul 17, 2013 at 7:29 PM, Lee, Howon <[email protected]> wrote: > Hey, I'm planning to make some sgd logistic regression models, serialize > them to save them and test my programs with these models. > > It seems pretty terrible to check them into my version control, because > they're binaries. > > Is there a good way to keep track of versions of my models, revert them, > etc, even though they're serialized and stuff? I've been thinking about > making a separate repo just for these models. Does anybody have any > experience and/or advice in this matter? >
