Re: Parallelizing UIMA

Karin Verspoor Thu, 31 Jul 2008 10:52:51 -0700

Thank you all ... wouldn't you know that you have already solved theproblem! I missed the UIMA-AS announcement as I was on travel lastweek, but I will definitely look into it, as well as Hadoop.


Thanks again,
Karin


On Jul 31, 2008, at 7:03 AM, Eddie Epstein wrote:

Hi Greg,
The system you describe using JavaSpaces is exactly one of thescenarios
UIMA AS is designed to support: a farm of workers that pull jobs off a
queue, then access the documents directly and store the resultsdirectly.
Yes, the "job" must be sent to the worker queue as a small CAS which
specifies one or more documents to process. Each worker wouldinclude a CASmultiplier delegate that acts as the collection reader. The workercan be astandard single-threaded UIMA aggregate, no additional overhead,just as youdescribe. The "job" CAS would then be returned with statusinformation.
In some situations the worker itself should be scaled to achievebetterutilization of a multi-core machine, for example when the entireworkeraggregate cannot be replicated because one or more of the delegatesis too
memory intensive. Here the worker can be configured as an asynchronous
aggregate, sharing fewer instances of memory intensive delegates andmoreinstances of other delegates. As long as all components arecolocated in thesame JVM, the in-memory CAS objects are shared, although the "call"to each
async delegate via a queue is longer than a normal subroutine call.

Regards,
Eddie

On Wed, Jul 30, 2008 at 7:08 PM, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
Karin--
I've done this using JavaSpaces. I guess you could do this withUIMA-AS,
depending on how many machines you want to use.
My goals were perhaps a little different than UIMA-AS's: I wantedto scaleto hundreds of CPUs and have them be easily manageable. That meanttheymust not be specialized, but generic workers. At this level ofscaling, thebottle-neck becomes the network, even if using Gigabit. Therefore,I mustminimize the data on the network, which means the document must beon itonly once and in it's most compact form, and the results on it onlyonce intheir most compact form. So, for my needs, passing around a XML-serialized
CAS would use too much bandwidth.

So in my system, the task that's put on the work queue is just the
document's URL and the name of an aggregate descriptor. Eachworker thread
on the cluster pulls a task from the queue, uses the URL to pull the
document directly from the source (HTTP, FTP, etc.), processes itthrough
the aggregate completely locally (no networking or even interprocess
communication on the same machine), and then inserts the resultsdirectly totheir destination (a SQL database). This is the absolute minimumnetworkconsumption possible. With some tuning, I achieved 97% theefficiency ofthe stand-alone library (i.e. without UIMA) operating on localfiles, and
linear scalability.
The trick to the efficiency is that the workers must pull tasksfrom aqueue not have the tasks pushed to them, since the client can'tpossiblyknow which workers need work (i.e, active load-balancing doesn'twork, in myopinion). The client just puts tasks in the queue in theJavaSpace. Whatthe client gets back is not results, but merely an indication ofcompletionand maybe some metrics. Generic workers can come and go, and thesystemjust goes faster or slower--no need for a human to try to allocatethe rightratios of different kinds of workers. While there is work to bedone, all
workers are full busy and the system is naturally perfectly balanced.
Another difference in my requirements is that I don't care aboutspeeding
up a single document, only the throughput of processing thousands of
documents. So that if one of the annotators is slow, I don't care,and Idon't need to parallelize within a single document. I onlyparallelize the
job, which is easy--each document is an independent unit of work.

I'd be glad to talk to you more if you're interested.

Greg Holmberg
650.283.3416 Cell


--
Karin Verspoor, PhD
Research Assistant Professor
Center for Computational Pharmacology, University of Colorado Denver
PO Box 6511, MS 8303, Aurora, CO 80045 USA
[EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758

Re: Parallelizing UIMA

Reply via email to