hi, we are just keeping them in hdfs, one directory with timestamp per model and a meta file gathering some metrics like AUC, number of training examples, class distribution. This makes it easy to generate reports out of it on the fly, why this would be very hard with git (plus there is no added value).
This might not be the best solution but it's a cheap way to see model performance over time and better than no history On Thu, Jul 18, 2013 at 7:15 AM, Ted Dunning <[email protected]> wrote: > Keeping old models is one thing. Keeping track of exactly which data you > trained with is another thing. > > Since you often need access to both old and new models at the same time, it > is common to simply burn a serial number into the file containing the model > and simply keep all of them. You need to keep a record as well of which > model resulted from which data using which build of the training software. > > This leads to the question of how you make sure you know what training data > you used. If your data is relatively small, then making a copy is a fine > idea. As your input data gets bigger and in a production setting where > data is coming in all the time, you may find that you need to start using > something like a snapshot so that you don't actually use n times the > storage for your data and so that you get an exact moment in time for the > training data. > > > > > On Wed, Jul 17, 2013 at 7:29 PM, Lee, Howon <[email protected]> > wrote: > > > Hey, I'm planning to make some sgd logistic regression models, serialize > > them to save them and test my programs with these models. > > > > It seems pretty terrible to check them into my version control, because > > they're binaries. > > > > Is there a good way to keep track of versions of my models, revert them, > > etc, even though they're serialized and stuff? I've been thinking about > > making a separate repo just for these models. Does anybody have any > > experience and/or advice in this matter? > > >
