On Sun, 15 Nov 2015, at 07:46 PM, Dale Scott wrote: > Hi, I've been lurking for a while and have a use case and architecture > that > I'd appreciate comments on. I've never personally built anything like > this > before. > > > > Without intentionally obfuscating, I have 128GB of data collected from an > experiment, roughly equivalent to a large set of 640x480 PNG images. > Images > are independent and analyzed image-by-image by an image recognition > algorithm. I was thinking of dividing the set of images into sub-sets by > a > scheduler and have a new EC2 instance analyze each sub-set. > > > > Are there any places in this scenario where couchdb would shine? > Replicating > a master couchdb image recognition library to each new EC2 instance? > Replicating the analysis results from each EC2 instance to a master > couchdb > database? > > > > Thanks! > > --- > > Dale R. Scott, P.Eng. > > Transparency with Trust
Welcome Dale! This sounds roughly like you have a message passing workflow: - Jobs are inserted into the system - N workers process Y jobs - The results are stored (or collated...) For a pure couchdb approach, see https://github.com/iriscouch/cqs & https://github.com/jo/couch-daemon in particular the links in the last one may be very interesting for your obfuscated use case. The general idea is to have workers actively pulling jobs off a couchdb, updating the doc with a time-stamped reservation, and having a reaper process to ensure that slow workers' docs are returned to the queue for another hopefully faster worker to pick it up. Using this + attachments may work well, or you may prefer to keep the queue separate from the raw data in a different db. However you may find using something like rabbitmq is easier here, or even some hosted cloud equivalent (maybe AWS lambda) but if you want to keep the raw & generated attachments in related (or the same) doc it may be better in couchdb. I know a number of people e.g. jhs@ who have successfully (ab)used couchdb as both a message queue and a backing store for this, it really depends on whether you want to use couchdb for everything, or have some other needs that are better served with a real message queue architecture + couchdb to store and transfer the potentially large image/data attachments instead of bloating the message queue. I think the tradeoff is largely around what else you need to do & how much data you are sending around, and whether you need a full-blown message queue system or can hack up the equivalents you need with couchdb instead. A+ Dave
