Thank you all ... wouldn't you know that you have already solved the
problem! I missed the UIMA-AS announcement as I was on travel last
week, but I will definitely look into it, as well as Hadoop.
Thanks again,
Karin
On Jul 31, 2008, at 7:03 AM, Eddie Epstein wrote:
Hi Greg,
The system you describe using JavaSpaces is exactly one of the
scenarios
UIMA AS is designed to support: a farm of workers that pull jobs off a
queue, then access the documents directly and store the results
directly.
Yes, the "job" must be sent to the worker queue as a small CAS which
specifies one or more documents to process. Each worker would
include a CAS
multiplier delegate that acts as the collection reader. The worker
can be a
standard single-threaded UIMA aggregate, no additional overhead,
just as you
describe. The "job" CAS would then be returned with status
information.
In some situations the worker itself should be scaled to achieve
better
utilization of a multi-core machine, for example when the entire
worker
aggregate cannot be replicated because one or more of the delegates
is too
memory intensive. Here the worker can be configured as an asynchronous
aggregate, sharing fewer instances of memory intensive delegates and
more
instances of other delegates. As long as all components are
colocated in the
same JVM, the in-memory CAS objects are shared, although the "call"
to each
async delegate via a queue is longer than a normal subroutine call.
Regards,
Eddie
On Wed, Jul 30, 2008 at 7:08 PM, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
Karin--
I've done this using JavaSpaces. I guess you could do this with
UIMA-AS,
depending on how many machines you want to use.
My goals were perhaps a little different than UIMA-AS's: I wanted
to scale
to hundreds of CPUs and have them be easily manageable. That meant
they
must not be specialized, but generic workers. At this level of
scaling, the
bottle-neck becomes the network, even if using Gigabit. Therefore,
I must
minimize the data on the network, which means the document must be
on it
only once and in it's most compact form, and the results on it only
once in
their most compact form. So, for my needs, passing around a XML-
serialized
CAS would use too much bandwidth.
So in my system, the task that's put on the work queue is just the
document's URL and the name of an aggregate descriptor. Each
worker thread
on the cluster pulls a task from the queue, uses the URL to pull the
document directly from the source (HTTP, FTP, etc.), processes it
through
the aggregate completely locally (no networking or even interprocess
communication on the same machine), and then inserts the results
directly to
their destination (a SQL database). This is the absolute minimum
network
consumption possible. With some tuning, I achieved 97% the
efficiency of
the stand-alone library (i.e. without UIMA) operating on local
files, and
linear scalability.
The trick to the efficiency is that the workers must pull tasks
from a
queue not have the tasks pushed to them, since the client can't
possibly
know which workers need work (i.e, active load-balancing doesn't
work, in my
opinion). The client just puts tasks in the queue in the
JavaSpace. What
the client gets back is not results, but merely an indication of
completion
and maybe some metrics. Generic workers can come and go, and the
system
just goes faster or slower--no need for a human to try to allocate
the right
ratios of different kinds of workers. While there is work to be
done, all
workers are full busy and the system is naturally perfectly balanced.
Another difference in my requirements is that I don't care about
speeding
up a single document, only the throughput of processing thousands of
documents. So that if one of the annotators is slow, I don't care,
and I
don't need to parallelize within a single document. I only
parallelize the
job, which is easy--each document is an independent unit of work.
I'd be glad to talk to you more if you're interested.
Greg Holmberg
650.283.3416 Cell
--
Karin Verspoor, PhD
Research Assistant Professor
Center for Computational Pharmacology, University of Colorado Denver
PO Box 6511, MS 8303, Aurora, CO 80045 USA
[EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758