Hi Dave--

The Tika MarkupAnnotator does this.

http://uima.apache.org/sandbox.html#tika.annotator

Greg Holmberg

Hi there,

I would like to create a pipeline that starts with HTML markup. I need
to strip this to plain text, so it can be processed by different
annotators, like POS, chunking, entity detection, etc. However I would
also like to keep track of which regions correspond to the original
html tags, like links, paragraphs, em, etc. Basically I would like a
final annotator that takes advantage of structural annotations (from
html) and semantic annotations (from the other components), all at
once.

So, I can imagine starting off with a component that strips the html
markup and adds annotations to keep track of the tags I am interested
in. Does such a component exist already? It seems like something a lot
of people would want.

If I do need to create it from scratch, what kind of component is it?
It's not just a straight annotator, because it needs to change the
SOFA: it needs to replace the markup with plain text.

Or should I have it create a new view of the document, so we maintain
a markup view and a plain text view of the document? This seems weird,
considering I will never care about the markup view again. Also, how
would I make sure the other annotators (which I won't be coding
myself) operate on the plain text view of the document rather than the
markup view?

Thanks, Dave

Reply via email to