Our current tactic for PDF and Word is to use third party software to
convert it to xml first. The long term desire would be to encapsulate
that behavior into a reader. We could then call the "pdf" reader and
then the "xml" reader, in sequence. 

Terry Heinze



-----Original Message-----
From: Thilo Goetz [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 15, 2008 10:23 AM
To: [email protected]
Subject: Re: How to "push" documents into CPE/CollectionReader?

[EMAIL PROTECTED] wrote:
> We've been using UIMA within a web service that processes one document
at a time. The single document is sent as a SOAP attachment. Metadata
configuring our UIMA components is included within the SOAP message. The
service module (which can also be executed as a standalone application)
gets a CAS from the pool, loads the document into the CAS, sequentially
calls the requested UIMA components, pulls the resulting document(s)
from the CAS and returns them to the service caller.
> 
> Rather than using a CPE, we wrote our own processing manager. We use a
xml configuration file, similar in format to the cpe descriptor. The
configuration file lists, in order, the readers, analysis engines, and
consumers that need to be instantiated. It also allows for overwriting
the respective descriptors. For example, any parameters and resource
file names can be changed via the configuration file. The processing
manager reads in the descriptors, modifies the contents (if requested),
saves the modified descriptors as an array of InputStreams. The
processing manager then creates the UIMA components and configures the
CAS pool.   
> 
> At this point we are not utilizing any of the sofa and type
input/output information to validate the order of execution -  this
processing manager controller is used for well-defined applications.  
> 
> The processing manager uses the metadata to control which reader(s)
and consumer(s) to execute. The web service will accept documents in
multiple formats (xml, html, text). The getNext() on the relevant reader
pulls the document from the CAS and parses it (the metadata may also
include xpath or similar instructions on how to find the plain text
portions), placing the result into the CAS. After the analysis engines
complete the annotating, one or more consumers are executed in order to
build the requested output (annotated original, annotated extract,
annotations only, xmi, etc.).  
> 
> Terry Heinze
> Thomson Corporation R&D

That sounds pretty powerful.  I know there are other people out there
who have/are working on systems that allow for dynamic configuration of
processing chains.  If more people find this useful, maybe it should be
in UIMA out of the box?

Question: are you considering to support binary formats such as PDF via,
e.g., Apache Tika?

--Thilo

Reply via email to