Karin--
I've done this using JavaSpaces. I guess you could do this with UIMA-AS,
depending on how many machines you want to use.
My goals were perhaps a little different than UIMA-AS's: I wanted to scale to
hundreds of CPUs and have them be easily manageable. That meant they must not
be specialized, but generic workers. At this level of scaling, the bottle-neck
becomes the network, even if using Gigabit. Therefore, I must minimize the
data on the network, which means the document must be on it only once and in
it's most compact form, and the results on it only once in their most compact
form. So, for my needs, passing around a XML-serialized CAS would use too much
bandwidth.
So in my system, the task that's put on the work queue is just the document's
URL and the name of an aggregate descriptor. Each worker thread on the cluster
pulls a task from the queue, uses the URL to pull the document directly from
the source (HTTP, FTP, etc.), processes it through the aggregate completely
locally (no networking or even interprocess communication on the same machine),
and then inserts the results directly to their destination (a SQL database).
This is the absolute minimum network consumption possible. With some tuning, I
achieved 97% the efficiency of the stand-alone library (i.e. without UIMA)
operating on local files, and linear scalability.
The trick to the efficiency is that the workers must pull tasks from a queue
not have the tasks pushed to them, since the client can't possibly know which
workers need work (i.e, active load-balancing doesn't work, in my opinion).
The client just puts tasks in the queue in the JavaSpace. What the client gets
back is not results, but merely an indication of completion and maybe some
metrics. Generic workers can come and go, and the system just goes faster or
slower--no need for a human to try to allocate the right ratios of different
kinds of workers. While there is work to be done, all workers are full busy
and the system is naturally perfectly balanced.
Another difference in my requirements is that I don't care about speeding up a
single document, only the throughput of processing thousands of documents. So
that if one of the annotators is slow, I don't care, and I don't need to
parallelize within a single document. I only parallelize the job, which is
easy--each document is an independent unit of work.
I'd be glad to talk to you more if you're interested.
Greg Holmberg
650.283.3416 Cell
-------------- Original message ----------------------
From: Karin Verspoor <[EMAIL PROTECTED]>
Does anyone have experience utilizing UIMA in a large processor
cluster to handle farming documents out for analysis by different
machines/processors?
Is there any documentation somewhere on doing this in the most
straightforward manner?
Thanks in advance for any assistance.
Karin
--
Karin Verspoor, PhD
Research Assistant Professor
Center for Computational Pharmacology, University of Colorado Denver
PO Box 6511, MS 8303, Aurora, CO 80045 USA
[EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758