Dale Scott wrote:
Without intentionally obfuscating, I have 128GB of data collected from an
experiment, roughly equivalent to a large set of 640x480 PNG images. Images
are independent and analyzed image-by-image by an image recognition
algorithm. I was thinking of dividing the set of images into sub-sets by a
scheduler and have a new EC2 instance analyze each sub-set.
You may find that replicating subsets of your data to the anaylzing
instance is unnecessary.  If I want to process a set of documents in
parallel and it isn't important where they are processed I write a view
function which assigns a random number to a document from 1..n: e.g.

function(doc) {
    var instances_count = 3;
    if(!doc.analyzer_result) {
        emit(''+Math.round(Math.random()*instance_count));
    }
}

For each analzyer assign it a number which is the key it will process
from the database.  If the analysis time > rtt of talking to CouchDB
this should be ok as is.  You could buffer documents at the fetch (query
with include_docs=true, fetch and limit=200) / save (use bulk update)
stage if the network time becomes significant in relation to the processing.

James
Zynstra is a private limited company registered in England and Wales 
(registered number 07864369). Our registered office and Headquarters are at The 
Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any 
attachments are confidential. If you have received this message in error please 
delete it from your system and advise the sender immediately.

Reply via email to