I'm trying to decide what to use as my primary format in the CAS, plain text or HTML.
I realize that any content (for example, HTML bytes in some encoding) can be stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting the MIME type to indicate what it is. However, only annotators that knew about that view and how to handle those bytes would be usable in an aggregate analysis engine. But I'm not building a closed system where I know all the annotators, I'm building a generic platform that can run an arbitrary AAE with annotators from a variety of unknown sources. Perhaps annotators from GATE, OpenNLP, downloaded from CMU, or bought from a vendor. In which case, these annotators would not know about my HTML view, and would fail to find anything to process. It appears that the only thing an annotator can count on in the CAS the String returned from JCas.getDocumentText(). I think this is intended to hold plain text, not HTML text. I'm guessing that plain text is what annotators from GATE, OpenNLP, etc. assume they will find there. If I were to setDocumentText() with some HTML, they probably wouldn't like it. But HTML has so much useful information for NLP processing. For example, suppose I have two cells adjacent in a row of a table, the first containing "1997" and the second "Honda Accord", and I want to run named entity extraction on the document. With the HTML boundaries, I would see they are in different cells, and produce two entities, YEAR "1997" and VEHICLE "Honda Accord". However, if I parse the HTML and convert it to plain text, then I might extract a single entity, VEHICLE "1997 Honda Accord". These are very different results. How can one make use of the HTML information and still use off-the-shelf annotators? My annotators can handle both plain text and HTML, but do better with HTML. If I put HTML in the CAS, then it appears that I will only be able to use my annotators and no others in the world. I think this defeats the purpose of using UIMA in the first place. Am I missing something? Can I have my cake and eat it too? (arbitrary annotators AND quality extraction) Greg Holmberg
