Re: plain text or HTML in the CAS?

Marshall Schor Thu, 26 Apr 2007 05:21:49 -0700

Have you considered using multiple subjects of analysis (Sofas)? Inthis scenario, you could have one Sofa which was the HTML, and runannotators on it that know about HTML markup, and how to usefullyinterpret it.

In another Sofa, you could have de-tagged HTML, and run annotators thatwant to work with that kind of input.

What this doesn't provide, I think, is a solution for the scenario youposit below, where words are in different cells of a table, and you wantthis to somehow translate to input to non-HTML-aware annotators thatthese words are not "close together" for purposes of recognizing namedentities, for instance.

That's an interesting problem: how to take annotators designed toprocess plain text streams, and make them operate well using additionalknowledge they weren't designed to consume. One really silly approachcould be to generate a text Sofa for these annotators, and insertartificial words between things which should be "separated" - theartificial words could be designed so that downstream processes couldeliminate them.


-Marshall

[EMAIL PROTECTED] wrote:

I'm trying to decide what to use as my primary format in the CAS, plain text or
HTML.

I realize that any content (for example, HTML bytes in some encoding) can be
stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting the
MIME type to indicate what it is. However, only annotators that knew about
that view and how to handle those bytes would be usable in an aggregate
analysis engine.

But I'm not building a closed system where I know all the annotators, I'm
building a generic platform that can run an arbitrary AAE with annotators from
a variety of unknown sources. Perhaps annotators from GATE, OpenNLP,
downloaded from CMU, or bought from a vendor. In which case, these annotators
would not know about my HTML view, and would fail to find anything to process.

It appears that the only thing an annotator can count on in the CAS the String
returned from JCas.getDocumentText(). I think this is intended to hold plain
text, not HTML text. I'm guessing that plain text is what annotators from
GATE, OpenNLP, etc. assume they will find there. If I were to
setDocumentText() with some HTML, they probably wouldn't like it.

But HTML has so much useful information for NLP processing. For example, suppose I have two cells adjacent in a row of a table,
the first containing "1997" and the second "Honda Accord", and I want to run named entity extraction on the
document. With the HTML boundaries, I would see they are in different cells, and produce two entities, YEAR "1997" and
VEHICLE "Honda Accord". However, if I parse the HTML and convert it to plain text, then I might extract a single
entity, VEHICLE "1997 Honda Accord". These are very different results.

How can one make use of the HTML information and still use off-the-shelf
annotators?

My annotators can handle both plain text and HTML, but do better with HTML. If
I put HTML in the CAS, then it appears that I will only be able to use my
annotators and no others in the world. I think this defeats the purpose of
using UIMA in the first place.

Am I missing something? Can I have my cake and eat it too? (arbitrary
annotators AND quality extraction)

Greg Holmberg

Re: plain text or HTML in the CAS?

Reply via email to