good approach; we used JMS, but similar concept I believe best
h On Wed, 30 Jul 2008, [EMAIL PROTECTED] wrote: > Karin-- > > > I've done this using JavaSpaces. I guess you could do this with UIMA-AS, > depending on how many machines you want to use. > > My goals were perhaps a little different than UIMA-AS's: I wanted to scale to > hundreds of CPUs and have them be easily manageable. That meant they must > not be specialized, but generic workers. At this level of scaling, the > bottle-neck becomes the network, even if using Gigabit. Therefore, I must > minimize the data on the network, which means the document must be on it only > once and in it's most compact form, and the results on it only once in their > most compact form. So, for my needs, passing around a XML-serialized CAS > would use too much bandwidth. > > So in my system, the task that's put on the work queue is just the document's > URL and the name of an aggregate descriptor. Each worker thread on the > cluster pulls a task from the queue, uses the URL to pull the document > directly from the source (HTTP, FTP, etc.), processes it through the > aggregate completely locally (no networking or even interprocess > communication on the same machine), and then inserts the results directly to > their destination (a SQL database). This is the absolute minimum network > consumption possible. With some tuning, I achieved 97% the efficiency of the > stand-alone library (i.e. without UIMA) operating on local files, and linear > scalability. > > The trick to the efficiency is that the workers must pull tasks from a queue > not have the tasks pushed to them, since the client can't possibly know which > workers need work (i.e, active load-balancing doesn't work, in my opinion). > The client just puts tasks in the queue in the JavaSpace. What the client > gets back is not results, but merely an indication of completion and maybe > some metrics. Generic workers can come and go, and the system just goes > faster or slower--no need for a human to try to allocate the right ratios of > different kinds of workers. While there is work to be done, all workers are > full busy and the system is naturally perfectly balanced. > > Another difference in my requirements is that I don't care about speeding up > a single document, only the throughput of processing thousands of documents. > So that if one of the annotators is slow, I don't care, and I don't need to > parallelize within a single document. I only parallelize the job, which is > easy--each document is an independent unit of work. > > I'd be glad to talk to you more if you're interested. > > Greg Holmberg > 650.283.3416 Cell > > > -------------- Original message ---------------------- > From: Karin Verspoor <[EMAIL PROTECTED]> > > Does anyone have experience utilizing UIMA in a large processor > > cluster to handle farming documents out for analysis by different > > machines/processors? > > > > Is there any documentation somewhere on doing this in the most > > straightforward manner? > > > > Thanks in advance for any assistance. > > > > Karin > > > > -- > > Karin Verspoor, PhD > > Research Assistant Professor > > Center for Computational Pharmacology, University of Colorado Denver > > PO Box 6511, MS 8303, Aurora, CO 80045 USA > > [EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758 > > > > > > > > > > >
