Have you considered using multiple subjects of analysis (Sofas)? In
this scenario, you could have one Sofa which was the HTML, and run
annotators on it that know about HTML markup, and how to usefully
interpret it.
In another Sofa, you could have de-tagged HTML, and run annotators that
want to work with that kind of input.
What this doesn't provide, I think, is a solution for the scenario you
posit below, where words are in different cells of a table, and you want
this to somehow translate to input to non-HTML-aware annotators that
these words are not "close together" for purposes of recognizing named
entities, for instance.
That's an interesting problem: how to take annotators designed to
process plain text streams, and make them operate well using additional
knowledge they weren't designed to consume. One really silly approach
could be to generate a text Sofa for these annotators, and insert
artificial words between things which should be "separated" - the
artificial words could be designed so that downstream processes could
eliminate them.
-Marshall
[EMAIL PROTECTED] wrote:
I'm trying to decide what to use as my primary format in the CAS, plain text or
HTML.
I realize that any content (for example, HTML bytes in some encoding) can be
stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting the
MIME type to indicate what it is. However, only annotators that knew about
that view and how to handle those bytes would be usable in an aggregate
analysis engine.
But I'm not building a closed system where I know all the annotators, I'm
building a generic platform that can run an arbitrary AAE with annotators from
a variety of unknown sources. Perhaps annotators from GATE, OpenNLP,
downloaded from CMU, or bought from a vendor. In which case, these annotators
would not know about my HTML view, and would fail to find anything to process.
It appears that the only thing an annotator can count on in the CAS the String
returned from JCas.getDocumentText(). I think this is intended to hold plain
text, not HTML text. I'm guessing that plain text is what annotators from
GATE, OpenNLP, etc. assume they will find there. If I were to
setDocumentText() with some HTML, they probably wouldn't like it.
But HTML has so much useful information for NLP processing. For example, suppose I have two cells adjacent in a row of a table,
the first containing "1997" and the second "Honda Accord", and I want to run named entity extraction on the
document. With the HTML boundaries, I would see they are in different cells, and produce two entities, YEAR "1997" and
VEHICLE "Honda Accord". However, if I parse the HTML and convert it to plain text, then I might extract a single
entity, VEHICLE "1997 Honda Accord". These are very different results.
How can one make use of the HTML information and still use off-the-shelf
annotators?
My annotators can handle both plain text and HTML, but do better with HTML. If
I put HTML in the CAS, then it appears that I will only be able to use my
annotators and no others in the world. I think this defeats the purpose of
using UIMA in the first place.
Am I missing something? Can I have my cake and eat it too? (arbitrary
annotators AND quality extraction)
Greg Holmberg