I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
problem here.  Consider it a holiday mind exercise while avoiding relatives.

I send customer-uploaded videos to Amazon Elastic Transcoder to generate a
video for html5 consumption.  It takes a few seconds up to tens of minutes
to convert.  I have no way to track progress so I estimate the time to
complete and show a fake progress indicator. I have been using the run-time
of the video and this is not working well at all.  High bit-rate (big file)
videos fare the worst.

I'm guessing there are two main parameters to estimate the conversion time,
the files size and run-time.  The file size is a good estimate of input
processing and run-rime is a good estimate of output processing.  Amazon
has been pretty consistent in their conversion times in the short-run.

I have tons of data in my couchdb from previous conversions.  I want to do
regression analysis of these past runs to calculate parameters for
estimation.  I know the file-size, run-time, and conversion time for each.

I will use runLen * A + fileSize * B as the estimation formula.  A and B
will be calculated by solving runLen@A + fileSize * B = convTime from the
samples.  It would be nice to use a map-reduce to always have the latest
estimate of A and B, if possible.

My first thought would be to just find the average for each of the three
input vars and solve for A and B using these averages.  However I'm pretty
sure this would yield the wrong result because each set of three samples
need to be used independently (not sure).

So I would like to have each map take one conversion sample and do the
regression in the reduce.  Can someone give me pointers on how to do this?

Reply via email to