Re: document structure (was: Discussion of next UIMA release)

Greg Holmberg Wed, 20 May 2009 23:53:40 -0700

On Tue, 19 May 2009 15:04:28 -0700, Eddie Epstein <[email protected]>wrote:

Since your original proposal back in 2007 there has been a growing
effort to add annotators to the project. Do you have any components
that use the proposed document model type system, say a collection
reader, that you would be willing to submit?

I did use this schema in a prototype. I used the Stax parser to convertXML to this annotation structure over plain text. Since the proposedschema losses no XML information, the XML can be reproduced from the CAS,if desired. Not byte-for-byte, since carriage ruturns may come outdifferently, but certainly functionally equivalent XML.

HTML was first cleaned up with HTMLCleaner, converted to XML (XHTML), andthen sent through the Stax parser and into the CAS.

For other formats, I used a commercial filtering product to convert PDF,Office, etc. to HTML, and then through the above process.

An open-source solution to filtering binary formats could use Aperture toproduce XML+RDF, and then through the above process.

The annotators I used didn't understand the CAS, only HTML, so I had tokeep that in addition to the CAS to feed to those annotators. The offsetsthese annotators returned were then relative to the HTML, so I kept a mapof offset ranges between the HTML and the plain-text in the CAS. This letme translate the offsets returned from the annotators against the HTMLinto offsets against the CAS, so when I created annotations they pointedto the right place.

So I can't contribute the commercial filter code (we don't have sourcecode anyway). I may be able to contribute the XML and HTML converters,since that code was never shipped as a product. However, it will requireapproval from some EVP three levels above me. I will look into it, butdon't hold your breath.




Greg Holmberg

Re: document structure (was: Discussion of next UIMA release)

Reply via email to