We've been using UIMA within a web service that processes one document at a 
time. The single document is sent as a SOAP attachment. Metadata configuring 
our UIMA components is included within the SOAP message. The service module 
(which can also be executed as a standalone application) gets a CAS from the 
pool, loads the document into the CAS, sequentially calls the requested UIMA 
components, pulls the resulting document(s) from the CAS and returns them to 
the service caller.

Rather than using a CPE, we wrote our own processing manager. We use a xml 
configuration file, similar in format to the cpe descriptor. The configuration 
file lists, in order, the readers, analysis engines, and consumers that need to 
be instantiated. It also allows for overwriting the respective descriptors. For 
example, any parameters and resource file names can be changed via the 
configuration file. The processing manager reads in the descriptors, modifies 
the contents (if requested), saves the modified descriptors as an array of 
InputStreams. The processing manager then creates the UIMA components and 
configures the CAS pool.   

At this point we are not utilizing any of the sofa and type input/output 
information to validate the order of execution -  this processing manager 
controller is used for well-defined applications.  

The processing manager uses the metadata to control which reader(s) and 
consumer(s) to execute. The web service will accept documents in multiple 
formats (xml, html, text). The getNext() on the relevant reader pulls the 
document from the CAS and parses it (the metadata may also include xpath or 
similar instructions on how to find the plain text portions), placing the 
result into the CAS. After the analysis engines complete the annotating, one or 
more consumers are executed in order to build the requested output (annotated 
original, annotated extract, annotations only, xmi, etc.).  

Terry Heinze
Thomson Corporation R&D

-----Original Message-----
From: Christoph Büscher [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 15, 2008 8:31 AM
To: [email protected]
Subject: How to "push" documents into CPE/CollectionReader?

Hi,

so far I've always used UIMA CPEs to read whole collections of documents from 
e.g. a source directory. In a new application it will be necessary to run a CPE 
on new documents beeing passed to it by another application (outside UIMA). It 
would be nice to be able to simply hand single documents over to a collection 
reader and then simply to "run/wake up" the CPE to process the document.

My idea was to put the incoming documents into a waiting queue, register this 
at a custom collection reader and then let the hasNext/getNext-Method simply to 
ask the queue if there is work to do. But when "hasNext()" in the collection 
reader returns "false", the CPE stops execution.

Is it possible to put a reader or the whole CPE into a "waiting" mode, or is 
the only solution to always restart the whole CPE once new documents have 
arrived to be processed? Has anybody dealt with a similar situation so far and 
has any "best practices" to share? How do you handle them ?

Thanks,

--
--------------------------------
Christoph Büscher

Reply via email to