On 4/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
That leaves me with a mix of HTML and plain text annotators, annotating different artifacts. The problem with that is I can't compare the annotations. A plain text annotator and an HTML annotator may have annotated the same logical text (same word, for example), but I have no way of determining that. So that means I can't answer questions that require both annotations.
I have heard of people doing HTML detagging and meanwhile recording offset mappings that will allow them to determine which offsets in the detagged text correspond to which offsets in the original HTML. After analyzing the detagged text, this offset mapping can then be used, for example, to highlight annotated spans in the HTML. I don't have the code for this, though. It sounds like it could be a very useful utility for UIMA users. If you had that kind of information produced by your HTML detagger, would it address your requirements? -Adam
