plain text or HTML in the CAS?

[EMAIL PROTECTED] Wed, 25 Apr 2007 23:39:11 -0700

I'm trying to decide what to use as my primary format in the CAS, plain text or 
HTML.


I realize that any content (for example, HTML bytes in some encoding) can be 
stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting the 
MIME type to indicate what it is.  However, only annotators that knew about 
that view and how to handle those bytes would be usable in an aggregate 
analysis engine.

But I'm not building a closed system where I know all the annotators, I'm 
building a generic platform that can run an arbitrary AAE with annotators from 
a variety of unknown sources.  Perhaps annotators from GATE, OpenNLP, 
downloaded from CMU, or bought from a vendor.  In which case, these annotators 
would not know about my HTML view, and would fail to find anything to process.

It appears that the only thing an annotator can count on in the CAS the String 
returned from JCas.getDocumentText().  I think this is intended to hold plain 
text, not HTML text.  I'm guessing that plain text is what annotators from 
GATE, OpenNLP, etc. assume they will find there.  If I were to 
setDocumentText() with some HTML, they probably wouldn't like it.

But HTML has so much useful information for NLP processing.  For example, 
suppose I have two cells adjacent in a row of a table, the first containing 
"1997" and the second "Honda Accord", and I want to run named entity extraction 
on the document.  With the HTML boundaries, I would see they are in different 
cells, and produce two entities, YEAR "1997" and VEHICLE "Honda Accord".  
However, if I parse the HTML and convert it to plain text, then I might extract 
a single entity, VEHICLE "1997 Honda Accord".  These are very different results.

How can one make use of the HTML information and still use off-the-shelf 
annotators?

My annotators can handle both plain text and HTML, but do better with HTML.  If 
I put HTML in the CAS, then it appears that I will only be able to use my 
annotators and no others in the world.  I think this defeats the purpose of 
using UIMA in the first place.

Am I missing something?  Can I have my cake and eat it too?  (arbitrary 
annotators AND quality extraction)


Greg Holmberg

plain text or HTML in the CAS?

Reply via email to