Re: plain text or HTML in the CAS?

[EMAIL PROTECTED] Thu, 26 Apr 2007 11:04:47 -0700

 -------------- Original message ----------------------
From: Marshall Schor <[EMAIL PROTECTED]>
> Have you considered using multiple subjects of analysis (Sofas)?  In 
> this scenario, you could have one Sofa which was the HTML, and run 
> annotators on it that know about HTML markup, and how to usefully 
> interpret it.
> 
> In another Sofa, you could have de-tagged HTML, and run annotators that 
> want to work with that kind of input.
> 
> What this doesn't provide, I think, is a solution for the scenario you 
> posit below, where words are in different cells of a table, and you want 
> this to somehow translate to input to non-HTML-aware annotators that 
> these words are not "close together" for purposes of recognizing named 
> entities, for instance.
> 
> That's an interesting problem: how to take annotators designed to 
> process plain text streams, and make them operate well using additional 
> knowledge they weren't designed to consume.  One really silly approach 
> could be to generate a text Sofa for these annotators, and insert 
> artificial words between things which should be "separated" - the 
> artificial words could be designed so that downstream processes could 
> eliminate them.

I have thought about that possibility--having both HTML and plain text 
available in the CAS and those annotators that know about the HTML (i.e. mine) 
can use it to produce better results, and off-the-shelf annotators will use the 
plain text.

I can even improve the plain text processing (although not to the quality level 
of extraction that direct HTML processing can provide) by inserting characters 
or sequences of characters into the plain text to indicate boundaries.  This is 
similar to what you are suggesting.  So, for example, while my HTML detagger 
would normally just concatenate all the text together, I can instead insert an 
end-of-paragraph marker (probably one or two newlines) after each cell in a 
table.  This creates the boundaries that would let even a plain text annotator 
see cells and produce entities "1997" and "Honda Accord" instead of a single 
(incorrect) "1997 Honda Accord" entity.

I have actually implemented this, and it works pretty well for plain text 
processing.  The problem is that while it is better, it is not as good as HTML 
processing.  There is still information in the HTML for which there is no 
equivalent in plain text.  So I can't move all my annotators to plain text, as 
they would lose extraction quality.

That leaves me with a mix of HTML and plain text annotators, annotating 
different artifacts.  The problem with that is I can't compare the annotations. 
 A plain text annotator and an HTML annotator may have annotated the same 
logical text (same word, for example), but I have no way of determining that.  
So that means I can't answer questions that require both annotations.

For example, say one annotator is an entity annotator that works on HTML and 
the other is a geography annotator that works on plain text.  I want to find 
pairs of cities mentioned in a document that are less than 50 miles apart.  I 
need the first annotator to find the city entities, and the second annotator to 
give me coordinates.  They may each annotate the text "New York City", but I 
can't know that because the offsets are very different, since they are against 
different artifacts.

Similarly, indexing these annotations into a search engine (Juru, for example) 
will not allow me to make queries that join two annotations that are logically 
against the same token because the offsets are different, and they don't appear 
to be against the same token.

One solution I thought of is to continue with plain text as the subject of 
analysis, and convert the HTML information into annotations.  For example, I 
might have Element annotations that reflect an XML element:

Feature Name        Super Type   Element Type
ElementAnnotation Annotation    
   attributes             FSArray        AttributeFS
   children               FSArray        ElementAnnotation
   name                  String 
   parent                 ElementAnnotation     
   qualifiedName    String      
   uri                       String     

Then use a SAX-based HTML parser like Neko to parse the HTML and generate these 
annotations.  This would work for XML too.  Or really anything that has 
document structure can be translated to this.

Now I can regenerate the HTML from these annototations in annotators that 
desire HTML.

Here's the problem: the offsets coming back from such an annotator would be 
against the HTML, and I need to translate the offset values so the work against 
the plain text in the CAS.

I can't think of a way to do that--does anyone have any ideas?

If I can't do that, then I will have to create a platform that uses HTML text 
in the CAS when all the annotators understand HTML, but "degenerates" to plain 
text when off-the-shelf annotators are used.  This complicates both my platform 
and my annotators, of course, and lowers the quality of extraction in 
plain-text mode.

Longer term, I'd like to modify the NLP components I'm using to understand 
these ElementAnnotations directly, bypassing the HTML but getting the same 
benefits.  It would then return offsets against the plain text, and I wouldn't 
have the translation problem.

What I'd like to see happen in the community would be an agreement on something 
like the ElementAnnotation above as a standard type definition and with 
expected (required?) instances in UIMA.  Existing plain text annotators would 
continue to work, while more up-to-date annotators would use the 
ElementAnnotations and the plain text.

So, are there any annotator suppliers out there who'd like to work on Type 
System standards with me?  The lack of them seems to be breaking the promise of 
UIMA as an integration platform.  The lowest common denominator data 
representations in UIMA seem a bit too low from my point of view.  UIMA's 
mechanism is great--now we need some policy.

Greg Holmberg

> 
> -Marshall
> 
> [EMAIL PROTECTED] wrote:
> > I'm trying to decide what to use as my primary format in the CAS, plain 
> > text 
> or HTML.
> >
> > I realize that any content (for example, HTML bytes in some encoding) can 
> > be 
> stored in the CAS using a ByteArray in JCas.setSofaDataArray() and setting 
> the 
> MIME type to indicate what it is.  However, only annotators that knew about 
> that 
> view and how to handle those bytes would be usable in an aggregate analysis 
> engine.
> >
> > But I'm not building a closed system where I know all the annotators, I'm 
> building a generic platform that can run an arbitrary AAE with annotators 
> from a 
> variety of unknown sources.  Perhaps annotators from GATE, OpenNLP, 
> downloaded 
> from CMU, or bought from a vendor.  In which case, these annotators would not 
> know about my HTML view, and would fail to find anything to process.
> >u
> > It appears that the only thing an annotator can count on in the CAS the 
> > String 
> returned from JCas.getDocumentText().  I think this is intended to hold plain 
> text, not HTML text.  I'm guessing that plain text is what annotators from 
> GATE, 
> OpenNLP, etc. assume they will find there.  If I were to setDocumentText() 
> with 
> some HTML, they probably wouldn't like it.
> >
> > But HTML has so much useful information for NLP processing.  For example, 
> suppose I have two cells adjacent in a row of a table, the first containing 
> "1997" and the second "Honda Accord", and I want to run named entity 
> extraction 
> on the document.  With the HTML boundaries, I would see they are in different 
> cells, and produce two entities, YEAR "1997" and VEHICLE "Honda Accord".  
> However, if I parse the HTML and convert it to plain text, then I might 
> extract 
> a single entity, VEHICLE "1997 Honda Accord".  These are very different 
> results.
> >
> > How can one make use of the HTML information and still use off-the-shelf 
> annotators?
> >
> > My annotators can handle both plain text and HTML, but do better with HTML. 
> >  
> If I put HTML in the CAS, then it appears that I will only be able to use my 
> annotators and no others in the world.  I think this defeats the purpose of 
> using UIMA in the first place.
> >
> > Am I missing something?  Can I have my cake and eat it too?  (arbitrary 
> annotators AND quality extraction)
> >
> >
> > Greg Holmberg
> >
> >
> >   
>

Re: plain text or HTML in the CAS?

Reply via email to