Re: Parallelizing UIMA

Thilo Goetz Thu, 31 Jul 2008 01:51:20 -0700

Or you can use hadoop, which has the additional advantage
of being the rage of the moment :-)


--Thilo

Hamish Cunningham wrote:

good approach; we used JMS, but similar concept I believe

best

h


On Wed, 30 Jul 2008, [EMAIL PROTECTED] wrote:

Karin--


I've done this using JavaSpaces.  I guess you could do this with UIMA-AS, 
depending on how many machines you want to use.

My goals were perhaps a little different than UIMA-AS's: I wanted to scale to 
hundreds of CPUs and have them be easily manageable.  That meant they must not 
be specialized, but generic workers.  At this level of scaling, the bottle-neck 
becomes the network, even if using Gigabit.  Therefore, I must minimize the 
data on the network, which means the document must be on it only once and in 
it's most compact form, and the results on it only once in their most compact 
form.  So, for my needs, passing around a XML-serialized CAS would use too much 
bandwidth.

So in my system, the task that's put on the work queue is just the document's 
URL and the name of an aggregate descriptor.  Each worker thread on the cluster 
pulls a task from the queue, uses the URL to pull the document directly from 
the source (HTTP, FTP, etc.), processes it through the aggregate completely 
locally (no networking or even interprocess communication on the same machine), 
and then inserts the results directly to their destination (a SQL database).  
This is the absolute minimum network consumption possible.  With some tuning, I 
achieved 97% the efficiency of the stand-alone library (i.e. without UIMA) 
operating on local files, and linear scalability.

The trick to the efficiency is that the workers must pull tasks from a queue 
not have the tasks pushed to them, since the client can't possibly know which 
workers need work (i.e, active load-balancing doesn't work, in my opinion).  
The client just puts tasks in the queue in the JavaSpace.  What the client gets 
back is not results, but merely an indication of completion and maybe some 
metrics.  Generic workers can come and go, and the system just goes faster or 
slower--no need for a human to try to allocate the right ratios of different 
kinds of workers.  While there is work to be done, all workers are full busy 
and the system is naturally perfectly balanced.

Another difference in my requirements is that I don't care about speeding up a 
single document, only the throughput of processing thousands of documents.  So 
that if one of the annotators is slow, I don't care, and I don't need to 
parallelize within a single document.  I only parallelize the job, which is 
easy--each document is an independent unit of work.

I'd be glad to talk to you more if you're interested.

Greg Holmberg
650.283.3416 Cell


 -------------- Original message ----------------------
From: Karin Verspoor <[EMAIL PROTECTED]>

Does anyone have experience utilizing UIMA in a large processor
cluster to handle farming documents out for analysis by different
machines/processors?

Is there any documentation somewhere on doing this in the most
straightforward manner?

Thanks in advance for any assistance.

Karin

--
Karin Verspoor, PhD
Research Assistant Professor
Center for Computational Pharmacology, University of Colorado Denver
PO Box 6511, MS 8303, Aurora, CO 80045 USA
[EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758

Re: Parallelizing UIMA

Reply via email to