Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look at it.
> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected]> wrote: > > Hi, > > you might want to take a look at two analysis engines of UIMA Ruta: > HtmlAnnotator and HtmlConverter [1] > > The former one creates annotations for html element and therefore also for > xml tags. The latter one creates a new view with only the plain text and adds > existing annotations while adapting their offsets to the new document. > > Best, > > Peter > > [1] > http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html > > Am 18.02.2015 um 21:46 schrieb Mario Gazzo: >> We are starting to use the UIMA framework for NL processing article text, >> which is usually stored with metadata in some XML format. We need to extract >> text elements to be processed by various NL analysis engines that only work >> with pure text but we also need to keep track of the formatting information >> related to the processed text. It is in general also valuable for us to be >> able to track every annotation back to the original XML to maintain >> provenance. Before embarking on this I like to validate our approach with >> more experienced users since this is the first application we are building >> with UIMA. >> >> In the first step we would annotate every important element of the XML >> including formatting elements in the body. We maintain some DOM-like >> relationships between the body text and formatting annotations so that text >> formatting can be reproduced later with NLP annotations in some article >> viewer. >> >> Next we would in another AE produce a pure text view of the text annotations >> in the XML view that need to be NL analysed. In this new text view we would >> annotate the different text elements with references back to their >> counterpart in the original XML view so that we can trace back positions in >> the original XML and the formatting relations. This of course will require >> mapping NLP annotation offsets in the text view back to the XML view but the >> information should then be there to make this possible. >> >> This approach requires somewhat more handcrafted book keeping than we >> initially hoped would be necessary. We haven’t been able to find any >> examples of how this is usually done and the UIMA docs are vague regarding >> managing this kind of relationships across views. We would therefore really >> like to know if there is a simpler and better approach. >> >> Any feedback is greatly appreciated. Thanks. >
