Hi Mark, It may not be worth it to have a real time estimate of the coefficients A, B if the variance is very small.
In other words if your collection of past videos cover most of the different kinds of videos you are likely to encounter then estimates of A, B are likely to be pretty robust and not change much with future new samples. So using older A,B is not likely to throw your conversion time predictions off by much. So if the next sample that comes along is not likely to add much change to the values of A, B. and you might as well update much less frequently - daily, weekly whatever - via cron or batch updates. How does one determine this - here's a "back of the envelope", "seat of the pants" experiment. So first, after calculating A, B using all my video conversion times I would do a second series of calculations. Here I would start with say the first 20 videos or some number = ~ 30-50% of your videos and calculate A,B. Then keep adding the next 5% , and repeat the calculation of A,B. Do this until you use up all the samples. But at the last step just add one video at a time for the last 10 videos while doing the calcs. Now look at A, B for each calculation. Do they settle down to be close to a "mean" A and "mean" B? What is the variance around the mean A, B? If this is small or very small, then re-computing every time is "really cool and all" but not worth it computationally. What is meant by "small" here? Well, take two successive estimates of A, B. Do a prediction using A1,B1 then A2,B2 how much are you off by if you use the older sample? If A, B don't vary much then your prediction won't vary much and you could use a stale sample without noticeable impact on your prediction. Noticeable = say off by more than 10% accuracy in prediction. Then just update A,B every day or week. Bottom line before you do a "real time" update of parameters do a "back of the envelope" experiment to see if it's worth it for the complexity and point-of-failure it adds. Happy to chat offlist and/or offline if you want - am nborwankar on the google email system. Nitin ------------------------------------------------------------------ Nitin Borwankar [email protected] On Wed, Nov 27, 2013 at 1:13 PM, Mark Hahn <[email protected]> wrote: > I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my > problem here. Consider it a holiday mind exercise while avoiding > relatives. > > I send customer-uploaded videos to Amazon Elastic Transcoder to generate a > video for html5 consumption. It takes a few seconds up to tens of minutes > to convert. I have no way to track progress so I estimate the time to > complete and show a fake progress indicator. I have been using the run-time > of the video and this is not working well at all. High bit-rate (big file) > videos fare the worst. > > I'm guessing there are two main parameters to estimate the conversion time, > the files size and run-time. The file size is a good estimate of input > processing and run-rime is a good estimate of output processing. Amazon > has been pretty consistent in their conversion times in the short-run. > > I have tons of data in my couchdb from previous conversions. I want to do > regression analysis of these past runs to calculate parameters for > estimation. I know the file-size, run-time, and conversion time for each. > > I will use runLen * A + fileSize * B as the estimation formula. A and B > will be calculated by solving runLen@A + fileSize * B = convTime from the > samples. It would be nice to use a map-reduce to always have the latest > estimate of A and B, if possible. > > My first thought would be to just find the average for each of the three > input vars and solve for A and B using these averages. However I'm pretty > sure this would yield the wrong result because each set of three samples > need to be used independently (not sure). > > So I would like to have each map take one conversion sample and do the > regression in the reduce. Can someone give me pointers on how to do this? >
