Hi David The components XML2CAS of the uima-connectors project [1,2] do that too in a similar way to the Tika MarkupAnnotator. You can also specify the input and the output views. The major differences are: * XML2CAS works only with XML but it allows you to specify the XML tags you want to turn into annotations in your CAS. And the created annotations have finer type structure (for example, annotations are both created for XML elements and attributes, all being interconnected). * MarkupAnnotator can handle HTML by adding the TagSoup parser jar [3] in the classpath.
Best [1] http://code.google.com/p/uima-common/downloads/detail?name=uima-common-v120111.jar [2] http://code.google.com/p/uima-connectors/downloads/detail?name=uima-connectors-v111205.jar [3] http://ccil.org/~cowan/XML/tagsoup/ On Tue, Jun 19, 2012 at 5:49 AM, Greg Holmberg <[email protected]> wrote: > Hi Dave-- > > The Tika MarkupAnnotator does this. > > http://uima.apache.org/sandbox.html#tika.annotator > > Greg Holmberg > > >> Hi there, >> >> I would like to create a pipeline that starts with HTML markup. I need >> to strip this to plain text, so it can be processed by different >> annotators, like POS, chunking, entity detection, etc. However I would >> also like to keep track of which regions correspond to the original >> html tags, like links, paragraphs, em, etc. Basically I would like a >> final annotator that takes advantage of structural annotations (from >> html) and semantic annotations (from the other components), all at >> once. >> >> So, I can imagine starting off with a component that strips the html >> markup and adds annotations to keep track of the tags I am interested >> in. Does such a component exist already? It seems like something a lot >> of people would want. >> >> If I do need to create it from scratch, what kind of component is it? >> It's not just a straight annotator, because it needs to change the >> SOFA: it needs to replace the markup with plain text. >> >> Or should I have it create a new view of the document, so we maintain >> a markup view and a plain text view of the document? This seems weird, >> considering I will never care about the markup view again. Also, how >> would I make sure the other annotators (which I won't be coding >> myself) operate on the plain text view of the document rather than the >> markup view? >> >> Thanks, Dave -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
