On Tue, 19 May 2009 15:04:28 -0700, Eddie Epstein <[email protected]>
wrote:
Since your original proposal back in 2007 there has been a growing
effort to add annotators to the project. Do you have any components
that use the proposed document model type system, say a collection
reader, that you would be willing to submit?
I did use this schema in a prototype. I used the Stax parser to convert
XML to this annotation structure over plain text. Since the proposed
schema losses no XML information, the XML can be reproduced from the CAS,
if desired. Not byte-for-byte, since carriage ruturns may come out
differently, but certainly functionally equivalent XML.
HTML was first cleaned up with HTMLCleaner, converted to XML (XHTML), and
then sent through the Stax parser and into the CAS.
For other formats, I used a commercial filtering product to convert PDF,
Office, etc. to HTML, and then through the above process.
An open-source solution to filtering binary formats could use Aperture to
produce XML+RDF, and then through the above process.
The annotators I used didn't understand the CAS, only HTML, so I had to
keep that in addition to the CAS to feed to those annotators. The offsets
these annotators returned were then relative to the HTML, so I kept a map
of offset ranges between the HTML and the plain-text in the CAS. This let
me translate the offsets returned from the annotators against the HTML
into offsets against the CAS, so when I created annotations they pointed
to the right place.
So I can't contribute the commercial filter code (we don't have source
code anyway). I may be able to contribute the XML and HTML converters,
since that code was never shipped as a product. However, it will require
approval from some EVP three levels above me. I will look into it, but
don't hold your breath.
Greg Holmberg