Re: plain text or HTML in the CAS?

Adam Lally Thu, 26 Apr 2007 11:47:11 -0700

On 4/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

That leaves me with a mix of HTML and plain text annotators, annotating 
different artifacts.  The problem with that is I can't compare the annotations. 
 A plain text annotator and an HTML annotator may have annotated the same 
logical text (same word, for example), but I have no way of determining that.  
So that means I can't answer questions that require both annotations.


I have heard of people doing HTML detagging and meanwhile recording
offset mappings that will allow them to determine which offsets in the
detagged text correspond to which offsets in the original HTML.  After
analyzing the detagged text, this offset mapping can then be used, for
example, to highlight annotated spans in the HTML.  I don't have the
code for this, though.  It sounds like it could be a very useful
utility for UIMA users.

If you had that kind of information produced by your HTML detagger,
would it address your requirements?

-Adam

Re: plain text or HTML in the CAS?

Reply via email to