I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my problem here. Consider it a holiday mind exercise while avoiding relatives.
I send customer-uploaded videos to Amazon Elastic Transcoder to generate a video for html5 consumption. It takes a few seconds up to tens of minutes to convert. I have no way to track progress so I estimate the time to complete and show a fake progress indicator. I have been using the run-time of the video and this is not working well at all. High bit-rate (big file) videos fare the worst. I'm guessing there are two main parameters to estimate the conversion time, the files size and run-time. The file size is a good estimate of input processing and run-rime is a good estimate of output processing. Amazon has been pretty consistent in their conversion times in the short-run. I have tons of data in my couchdb from previous conversions. I want to do regression analysis of these past runs to calculate parameters for estimation. I know the file-size, run-time, and conversion time for each. I will use runLen * A + fileSize * B as the estimation formula. A and B will be calculated by solving runLen@A + fileSize * B = convTime from the samples. It would be nice to use a map-reduce to always have the latest estimate of A and B, if possible. My first thought would be to just find the average for each of the three input vars and solve for A and B using these averages. However I'm pretty sure this would yield the wrong result because each set of three samples need to be used independently (not sure). So I would like to have each map take one conversion sample and do the regression in the reduce. Can someone give me pointers on how to do this?
