Hi, I contributed an annotator to the sandbox some time ago which uses Tika to convert original markup into UIMA annotations. It does not seem to be listed on the website but it should be in the SVN repository of the sandbox.
Tika supports numerous formats such as PDF, XML, HTML etc... Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/5/21 Greg Holmberg <[email protected]> > On Tue, 19 May 2009 15:04:28 -0700, Eddie Epstein <[email protected]> > wrote: > >> Since your original proposal back in 2007 there has been a growing >> effort to add annotators to the project. Do you have any components >> that use the proposed document model type system, say a collection >> reader, that you would be willing to submit? >> > > I did use this schema in a prototype. I used the Stax parser to convert > XML to this annotation structure over plain text. Since the proposed schema > losses no XML information, the XML can be reproduced from the CAS, if > desired. Not byte-for-byte, since carriage ruturns may come out differently, > but certainly functionally equivalent XML. > > HTML was first cleaned up with HTMLCleaner, converted to XML (XHTML), and > then sent through the Stax parser and into the CAS. > > For other formats, I used a commercial filtering product to convert PDF, > Office, etc. to HTML, and then through the above process. > > An open-source solution to filtering binary formats could use Aperture to > produce XML+RDF, and then through the above process. > > The annotators I used didn't understand the CAS, only HTML, so I had to > keep that in addition to the CAS to feed to those annotators. The offsets > these annotators returned were then relative to the HTML, so I kept a map of > offset ranges between the HTML and the plain-text in the CAS. This let me > translate the offsets returned from the annotators against the HTML into > offsets against the CAS, so when I created annotations they pointed to the > right place. > > So I can't contribute the commercial filter code (we don't have source code > anyway). I may be able to contribute the XML and HTML converters, since > that code was never shipped as a product. However, it will require approval > from some EVP three levels above me. I will look into it, but don't hold > your breath. > > > > Greg Holmberg >
