RE: plain text or HTML in the CAS?

Ding, Jing Thu, 26 Apr 2007 11:28:47 -0700

HTML and XML are inline annotations. Mixing them with the original text in
CAS, in my opinion, is a bad idea. Annotators should all be developed against
plain text CAS, yet aware of any existing annotations to improve performance.

Jing Ding, PhD
Senior Systems Consultant
Information Warehouse
Ohio State University Medical Center
640 Ackerman Road, C#1-150
PO Box 183111
Columbus, OH 43218-3111
Phone: (614) 293-0776
Fax: (614) 293-2210
[EMAIL PROTECTED]

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 26, 2007 2:04 PM
To: [email protected]; [email protected]
Cc: Marshall Schor
Subject: Re: plain text or HTML in the CAS?

 -------------- Original message ----------------------
From: Marshall Schor <[EMAIL PROTECTED]>
> Have you considered using multiple subjects of analysis (Sofas)?  In 
> this scenario, you could have one Sofa which was the HTML, and run 
> annotators on it that know about HTML markup, and how to usefully 
> interpret it.
> 
> In another Sofa, you could have de-tagged HTML, and run annotators that 
> want to work with that kind of input.
> 
> What this doesn't provide, I think, is a solution for the scenario you 
> posit below, where words are in different cells of a table, and you want 
> this to somehow translate to input to non-HTML-aware annotators that 
> these words are not "close together" for purposes of recognizing named 
> entities, for instance.
> 
> That's an interesting problem: how to take annotators designed to 
> process plain text streams, and make them operate well using additional 
> knowledge they weren't designed to consume.  One really silly approach 
> could be to generate a text Sofa for these annotators, and insert 
> artificial words between things which should be "separated" - the 
> artificial words could be designed so that downstream processes could 
> eliminate them.

I have thought about that possibility--having both HTML and plain text
available in the CAS and those annotators that know about the HTML (i.e.
mine) can use it to produce better results, and off-the-shelf annotators will
use the plain text.

I can even improve the plain text processing (although not to the quality
level of extraction that direct HTML processing can provide) by inserting
characters or sequences of characters into the plain text to indicate
boundaries.  This is similar to what you are suggesting.  So, for example,
while my HTML detagger would normally just concatenate all the text together,
I can instead insert an end-of-paragraph marker (probably one or two
newlines) after each cell in a table.  This creates the boundaries that would
let even a plain text annotator see cells and produce entities "1997" and
"Honda Accord" instead of a single (incorrect) "1997 Honda Accord" entity.

I have actually implemented this, and it works pretty well for plain text
processing.  The problem is that while it is better, it is not as good as
HTML processing.  There is still information in the HTML for which there is
no equivalent in plain text.  So I can't move all my annotators to plain
text, as they would lose extraction quality.

That leaves me with a mix of HTML and plain text annotators, annotating
different artifacts.  The problem with that is I can't compare the
annotations.  A plain text annotator and an HTML annotator may have annotated
the same logical text (same word, for example), but I have no way of
determining that.  So that means I can't answer questions that require both
annotations.

For example, say one annotator is an entity annotator that works on HTML and
the other is a geography annotator that works on plain text.  I want to find
pairs of cities mentioned in a document that are less than 50 miles apart.  I
need the first annotator to find the city entities, and the second annotator
to give me coordinates.  They may each annotate the text "New York City", but
I can't know that because the offsets are very different, since they are
against different artifacts.

Similarly, indexing these annotations into a search engine (Juru, for
example) will not allow me to make queries that join two annotations that are
logically against the same token because the offsets are different, and they
don't appear to be against the same token.

One solution I thought of is to continue with plain text as the subject of
analysis, and convert the HTML information into annotations.  For example, I
might have Element annotations that reflect an XML element:

Feature Name        Super Type   Element Type
ElementAnnotation Annotation    
   attributes             FSArray        AttributeFS
   children               FSArray        ElementAnnotation
   name                  String 
   parent                 ElementAnnotation     
   qualifiedName    String      
   uri                       String     

Then use a SAX-based HTML parser like Neko to parse the HTML and generate
these annotations.  This would work for XML too.  Or really anything that has
document structure can be translated to this.

Now I can regenerate the HTML from these annototations in annotators that
desire HTML.

Here's the problem: the offsets coming back from such an annotator would be
against the HTML, and I need to translate the offset values so the work
against the plain text in the CAS.

I can't think of a way to do that--does anyone have any ideas?

If I can't do that, then I will have to create a platform that uses HTML text
in the CAS when all the annotators understand HTML, but "degenerates" to
plain text when off-the-shelf annotators are used.  This complicates both my
platform and my annotators, of course, and lowers the quality of extraction
in plain-text mode.

Longer term, I'd like to modify the NLP components I'm using to understand
these ElementAnnotations directly, bypassing the HTML but getting the same
benefits.  It would then return offsets against the plain text, and I
wouldn't have the translation problem.

What I'd like to see happen in the community would be an agreement on
something like the ElementAnnotation above as a standard type definition and
with expected (required?) instances in UIMA.  Existing plain text annotators
would continue to work, while more up-to-date annotators would use the
ElementAnnotations and the plain text.

So, are there any annotator suppliers out there who'd like to work on Type
System standards with me?  The lack of them seems to be breaking the promise
of UIMA as an integration platform.  The lowest common denominator data
representations in UIMA seem a bit too low from my point of view.  UIMA's
mechanism is great--now we need some policy.

Greg Holmberg

> 
> -Marshall
> 
> [EMAIL PROTECTED] wrote:
> > I'm trying to decide what to use as my primary format in the CAS, plain
text 
> or HTML.
> >
> > I realize that any content (for example, HTML bytes in some encoding) can
be 
> stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting
the 
> MIME type to indicate what it is.  However, only annotators that knew about
that 
> view and how to handle those bytes would be usable in an aggregate analysis

> engine.
> >
> > But I'm not building a closed system where I know all the annotators, I'm

> building a generic platform that can run an arbitrary AAE with annotators
from a 
> variety of unknown sources.  Perhaps annotators from GATE, OpenNLP,
downloaded 
> from CMU, or bought from a vendor.  In which case, these annotators would
not 
> know about my HTML view, and would fail to find anything to process.
> >u
> > It appears that the only thing an annotator can count on in the CAS the
String 
> returned from JCas.getDocumentText().  I think this is intended to hold
plain 
> text, not HTML text.  I'm guessing that plain text is what annotators from
GATE, 
> OpenNLP, etc. assume they will find there.  If I were to setDocumentText()
with 
> some HTML, they probably wouldn't like it.
> >
> > But HTML has so much useful information for NLP processing.  For example,

> suppose I have two cells adjacent in a row of a table, the first containing

> "1997" and the second "Honda Accord", and I want to run named entity
extraction 
> on the document.  With the HTML boundaries, I would see they are in
different 
> cells, and produce two entities, YEAR "1997" and VEHICLE "Honda Accord".  
> However, if I parse the HTML and convert it to plain text, then I might
extract 
> a single entity, VEHICLE "1997 Honda Accord".  These are very different
results.
> >
> > How can one make use of the HTML information and still use off-the-shelf 
> annotators?
> >
> > My annotators can handle both plain text and HTML, but do better with
HTML.  
> If I put HTML in the CAS, then it appears that I will only be able to use
my 
> annotators and no others in the world.  I think this defeats the purpose of

> using UIMA in the first place.
> >
> > Am I missing something?  Can I have my cake and eat it too?  (arbitrary 
> annotators AND quality extraction)
> >
> >
> > Greg Holmberg
> >
> >
> >   
>

RE: plain text or HTML in the CAS?

Reply via email to