simple math/statistics problem using map reduce

Mark Hahn Wed, 27 Nov 2013 13:15:23 -0800

I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
problem here.  Consider it a holiday mind exercise while avoiding relatives.


I send customer-uploaded videos to Amazon Elastic Transcoder to generate a
video for html5 consumption.  It takes a few seconds up to tens of minutes
to convert.  I have no way to track progress so I estimate the time to
complete and show a fake progress indicator. I have been using the run-time
of the video and this is not working well at all.  High bit-rate (big file)
videos fare the worst.

I'm guessing there are two main parameters to estimate the conversion time,
the files size and run-time.  The file size is a good estimate of input
processing and run-rime is a good estimate of output processing.  Amazon
has been pretty consistent in their conversion times in the short-run.

I have tons of data in my couchdb from previous conversions.  I want to do
regression analysis of these past runs to calculate parameters for
estimation.  I know the file-size, run-time, and conversion time for each.

I will use runLen * A + fileSize * B as the estimation formula.  A and B
will be calculated by solving runLen@A + fileSize * B = convTime from the
samples.  It would be nice to use a map-reduce to always have the latest
estimate of A and B, if possible.

My first thought would be to just find the average for each of the three
input vars and solve for A and B using these averages.  However I'm pretty
sure this would yield the wrong result because each set of three samples
need to be used independently (not sure).

So I would like to have each map take one conversion sample and do the
regression in the reduce.  Can someone give me pointers on how to do this?

simple math/statistics problem using map reduce

Reply via email to