Hi Dave-- The Tika MarkupAnnotator does this.
http://uima.apache.org/sandbox.html#tika.annotator Greg Holmberg
Hi there, I would like to create a pipeline that starts with HTML markup. I need to strip this to plain text, so it can be processed by different annotators, like POS, chunking, entity detection, etc. However I would also like to keep track of which regions correspond to the original html tags, like links, paragraphs, em, etc. Basically I would like a final annotator that takes advantage of structural annotations (from html) and semantic annotations (from the other components), all at once. So, I can imagine starting off with a component that strips the html markup and adds annotations to keep track of the tags I am interested in. Does such a component exist already? It seems like something a lot of people would want. If I do need to create it from scratch, what kind of component is it? It's not just a straight annotator, because it needs to change the SOFA: it needs to replace the markup with plain text. Or should I have it create a new view of the document, so we maintain a markup view and a plain text view of the document? This seems weird, considering I will never care about the markup view again. Also, how would I make sure the other annotators (which I won't be coding myself) operate on the plain text view of the document rather than the markup view? Thanks, Dave
