Karin--
I've done this using JavaSpaces. I guess you could do this with UIMA-AS, depending on how many machines you want to use. My goals were perhaps a little different than UIMA-AS's: I wanted to scale to hundreds of CPUs and have them be easily manageable. That meant they must not be specialized, but generic workers. At this level of scaling, the bottle-neck becomes the network, even if using Gigabit. Therefore, I must minimize the data on the network, which means the document must be on it only once and in it's most compact form, and the results on it only once in their most compact form. So, for my needs, passing around a XML-serialized CAS would use too much bandwidth. So in my system, the task that's put on the work queue is just the document's URL and the name of an aggregate descriptor. Each worker thread on the cluster pulls a task from the queue, uses the URL to pull the document directly from the source (HTTP, FTP, etc.), processes it through the aggregate completely locally (no networking or even interprocess communication on the same machine), and then inserts the results directly to their destination (a SQL database). This is the absolute minimum network consumption possible. With some tuning, I achieved 97% the efficiency of the stand-alone library (i.e. without UIMA) operating on local files, and linear scalability. The trick to the efficiency is that the workers must pull tasks from a queue not have the tasks pushed to them, since the client can't possibly know which workers need work (i.e, active load-balancing doesn't work, in my opinion). The client just puts tasks in the queue in the JavaSpace. What the client gets back is not results, but merely an indication of completion and maybe some metrics. Generic workers can come and go, and the system just goes faster or slower--no need for a human to try to allocate the right ratios of different kinds of workers. While there is work to be done, all workers are full busy and the system is naturally perfectly balanced. Another difference in my requirements is that I don't care about speeding up a single document, only the throughput of processing thousands of documents. So that if one of the annotators is slow, I don't care, and I don't need to parallelize within a single document. I only parallelize the job, which is easy--each document is an independent unit of work. I'd be glad to talk to you more if you're interested. Greg Holmberg 650.283.3416 Cell -------------- Original message ---------------------- From: Karin Verspoor <[EMAIL PROTECTED]> > Does anyone have experience utilizing UIMA in a large processor > cluster to handle farming documents out for analysis by different > machines/processors? > > Is there any documentation somewhere on doing this in the most > straightforward manner? > > Thanks in advance for any assistance. > > Karin > > -- > Karin Verspoor, PhD > Research Assistant Professor > Center for Computational Pharmacology, University of Colorado Denver > PO Box 6511, MS 8303, Aurora, CO 80045 USA > [EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758 > > > > >
