[EMAIL PROTECTED] wrote:
We've been using UIMA within a web service that processes one document at a 
time. The single document is sent as a SOAP attachment. Metadata configuring 
our UIMA components is included within the SOAP message. The service module 
(which can also be executed as a standalone application) gets a CAS from the 
pool, loads the document into the CAS, sequentially calls the requested UIMA 
components, pulls the resulting document(s) from the CAS and returns them to 
the service caller.

Rather than using a CPE, we wrote our own processing manager. We use a xml configuration file, similar in format to the cpe descriptor. The configuration file lists, in order, the readers, analysis engines, and consumers that need to be instantiated. It also allows for overwriting the respective descriptors. For example, any parameters and resource file names can be changed via the configuration file. The processing manager reads in the descriptors, modifies the contents (if requested), saves the modified descriptors as an array of InputStreams. The processing manager then creates the UIMA components and configures the CAS pool. At this point we are not utilizing any of the sofa and type input/output information to validate the order of execution - this processing manager controller is used for well-defined applications. The processing manager uses the metadata to control which reader(s) and consumer(s) to execute. The web service will accept documents in multiple formats (xml, html, text). The getNext() on the relevant reader pulls the document from the CAS and parses it (the metadata may also include xpath or similar instructions on how to find the plain text portions), placing the result into the CAS. After the analysis engines complete the annotating, one or more consumers are executed in order to build the requested output (annotated original, annotated extract, annotations only, xmi, etc.).
Terry Heinze
Thomson Corporation R&D

That sounds pretty powerful.  I know there are other people out there
who have/are working on systems that allow for dynamic configuration
of processing chains.  If more people find this useful, maybe it should
be in UIMA out of the box?

Question: are you considering to support binary formats such as PDF via,
e.g., Apache Tika?

--Thilo

Reply via email to