good approach; we used JMS, but similar concept I believe

best

h


On Wed, 30 Jul 2008, [EMAIL PROTECTED] wrote:

> Karin--
>
>
> I've done this using JavaSpaces.  I guess you could do this with UIMA-AS, 
> depending on how many machines you want to use.
>
> My goals were perhaps a little different than UIMA-AS's: I wanted to scale to 
> hundreds of CPUs and have them be easily manageable.  That meant they must 
> not be specialized, but generic workers.  At this level of scaling, the 
> bottle-neck becomes the network, even if using Gigabit.  Therefore, I must 
> minimize the data on the network, which means the document must be on it only 
> once and in it's most compact form, and the results on it only once in their 
> most compact form.  So, for my needs, passing around a XML-serialized CAS 
> would use too much bandwidth.
>
> So in my system, the task that's put on the work queue is just the document's 
> URL and the name of an aggregate descriptor.  Each worker thread on the 
> cluster pulls a task from the queue, uses the URL to pull the document 
> directly from the source (HTTP, FTP, etc.), processes it through the 
> aggregate completely locally (no networking or even interprocess 
> communication on the same machine), and then inserts the results directly to 
> their destination (a SQL database).  This is the absolute minimum network 
> consumption possible.  With some tuning, I achieved 97% the efficiency of the 
> stand-alone library (i.e. without UIMA) operating on local files, and linear 
> scalability.
>
> The trick to the efficiency is that the workers must pull tasks from a queue 
> not have the tasks pushed to them, since the client can't possibly know which 
> workers need work (i.e, active load-balancing doesn't work, in my opinion).  
> The client just puts tasks in the queue in the JavaSpace.  What the client 
> gets back is not results, but merely an indication of completion and maybe 
> some metrics.  Generic workers can come and go, and the system just goes 
> faster or slower--no need for a human to try to allocate the right ratios of 
> different kinds of workers.  While there is work to be done, all workers are 
> full busy and the system is naturally perfectly balanced.
>
> Another difference in my requirements is that I don't care about speeding up 
> a single document, only the throughput of processing thousands of documents.  
> So that if one of the annotators is slow, I don't care, and I don't need to 
> parallelize within a single document.  I only parallelize the job, which is 
> easy--each document is an independent unit of work.
>
> I'd be glad to talk to you more if you're interested.
>
> Greg Holmberg
> 650.283.3416 Cell
>
>
>  -------------- Original message ----------------------
> From: Karin Verspoor <[EMAIL PROTECTED]>
> > Does anyone have experience utilizing UIMA in a large processor
> > cluster to handle farming documents out for analysis by different
> > machines/processors?
> >
> > Is there any documentation somewhere on doing this in the most
> > straightforward manner?
> >
> > Thanks in advance for any assistance.
> >
> > Karin
> >
> > --
> > Karin Verspoor, PhD
> > Research Assistant Professor
> > Center for Computational Pharmacology, University of Colorado Denver
> > PO Box 6511, MS 8303, Aurora, CO 80045 USA
> > [EMAIL PROTECTED] / tel: (720) 279-4875 / campus: 4-3758
> >
> >
> >
> >
> >
>

Reply via email to