Hi Greg, The system you describe using JavaSpaces is exactly one of the scenarios UIMA AS is designed to support: a farm of workers that pull jobs off a queue, then access the documents directly and store the results directly. Yes, the "job" must be sent to the worker queue as a small CAS which specifies one or more documents to process. Each worker would include a CAS multiplier delegate that acts as the collection reader. The worker can be a standard single-threaded UIMA aggregate, no additional overhead, just as you describe. The "job" CAS would then be returned with status information.
In some situations the worker itself should be scaled to achieve better utilization of a multi-core machine, for example when the entire worker aggregate cannot be replicated because one or more of the delegates is too memory intensive. Here the worker can be configured as an asynchronous aggregate, sharing fewer instances of memory intensive delegates and more instances of other delegates. As long as all components are colocated in the same JVM, the in-memory CAS objects are shared, although the "call" to each async delegate via a queue is longer than a normal subroutine call. Regards, Eddie On Wed, Jul 30, 2008 at 7:08 PM, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: > Karin-- > > > I've done this using JavaSpaces. I guess you could do this with UIMA-AS, > depending on how many machines you want to use. > > My goals were perhaps a little different than UIMA-AS's: I wanted to scale > to hundreds of CPUs and have them be easily manageable. That meant they > must not be specialized, but generic workers. At this level of scaling, the > bottle-neck becomes the network, even if using Gigabit. Therefore, I must > minimize the data on the network, which means the document must be on it > only once and in it's most compact form, and the results on it only once in > their most compact form. So, for my needs, passing around a XML-serialized > CAS would use too much bandwidth. > > So in my system, the task that's put on the work queue is just the > document's URL and the name of an aggregate descriptor. Each worker thread > on the cluster pulls a task from the queue, uses the URL to pull the > document directly from the source (HTTP, FTP, etc.), processes it through > the aggregate completely locally (no networking or even interprocess > communication on the same machine), and then inserts the results directly to > their destination (a SQL database). This is the absolute minimum network > consumption possible. With some tuning, I achieved 97% the efficiency of > the stand-alone library (i.e. without UIMA) operating on local files, and > linear scalability. > > The trick to the efficiency is that the workers must pull tasks from a > queue not have the tasks pushed to them, since the client can't possibly > know which workers need work (i.e, active load-balancing doesn't work, in my > opinion). The client just puts tasks in the queue in the JavaSpace. What > the client gets back is not results, but merely an indication of completion > and maybe some metrics. Generic workers can come and go, and the system > just goes faster or slower--no need for a human to try to allocate the right > ratios of different kinds of workers. While there is work to be done, all > workers are full busy and the system is naturally perfectly balanced. > > Another difference in my requirements is that I don't care about speeding > up a single document, only the throughput of processing thousands of > documents. So that if one of the annotators is slow, I don't care, and I > don't need to parallelize within a single document. I only parallelize the > job, which is easy--each document is an independent unit of work. > > I'd be glad to talk to you more if you're interested. > > Greg Holmberg > 650.283.3416 Cell > > >
