Our current tactic for PDF and Word is to use third party software to convert it to xml first. The long term desire would be to encapsulate that behavior into a reader. We could then call the "pdf" reader and then the "xml" reader, in sequence.
Terry Heinze -----Original Message----- From: Thilo Goetz [mailto:[EMAIL PROTECTED] Sent: Friday, February 15, 2008 10:23 AM To: [email protected] Subject: Re: How to "push" documents into CPE/CollectionReader? [EMAIL PROTECTED] wrote: > We've been using UIMA within a web service that processes one document at a time. The single document is sent as a SOAP attachment. Metadata configuring our UIMA components is included within the SOAP message. The service module (which can also be executed as a standalone application) gets a CAS from the pool, loads the document into the CAS, sequentially calls the requested UIMA components, pulls the resulting document(s) from the CAS and returns them to the service caller. > > Rather than using a CPE, we wrote our own processing manager. We use a xml configuration file, similar in format to the cpe descriptor. The configuration file lists, in order, the readers, analysis engines, and consumers that need to be instantiated. It also allows for overwriting the respective descriptors. For example, any parameters and resource file names can be changed via the configuration file. The processing manager reads in the descriptors, modifies the contents (if requested), saves the modified descriptors as an array of InputStreams. The processing manager then creates the UIMA components and configures the CAS pool. > > At this point we are not utilizing any of the sofa and type input/output information to validate the order of execution - this processing manager controller is used for well-defined applications. > > The processing manager uses the metadata to control which reader(s) and consumer(s) to execute. The web service will accept documents in multiple formats (xml, html, text). The getNext() on the relevant reader pulls the document from the CAS and parses it (the metadata may also include xpath or similar instructions on how to find the plain text portions), placing the result into the CAS. After the analysis engines complete the annotating, one or more consumers are executed in order to build the requested output (annotated original, annotated extract, annotations only, xmi, etc.). > > Terry Heinze > Thomson Corporation R&D That sounds pretty powerful. I know there are other people out there who have/are working on systems that allow for dynamic configuration of processing chains. If more people find this useful, maybe it should be in UIMA out of the box? Question: are you considering to support binary formats such as PDF via, e.g., Apache Tika? --Thilo
