Re: Approach for keeping track of formatting associated with text views

Mario Gazzo Wed, 18 Feb 2015 14:05:40 -0800

Thanks. Looks interesting, seems that it could fit our use case. We will have a 
closer look at it.


> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected]> wrote:
> 
> Hi,
> 
> you might want to take a look at two analysis engines of UIMA Ruta: 
> HtmlAnnotator and HtmlConverter [1]
> 
> The former one creates annotations for html element and therefore also for 
> xml tags. The latter one creates a new view with only the plain text and adds 
> existing annotations while adapting their offsets to the new document.
> 
> Best,
> 
> Peter
> 
> [1] 
> http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
> 
> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>> We are starting to use the UIMA framework for NL processing article text, 
>> which is usually stored with metadata in some XML format. We need to extract 
>> text elements to be processed by various NL analysis engines that only work 
>> with pure text but we also need to keep track of the formatting information 
>> related to the processed text. It is in general also valuable for us to be 
>> able to track every annotation back to the original XML to maintain 
>> provenance. Before embarking on this I like to validate our approach with 
>> more experienced users since this is the first application we are building 
>> with UIMA.
>> 
>> In the first step we would annotate every important element of the XML 
>> including formatting elements in the body. We maintain some DOM-like 
>> relationships between the body text and formatting annotations so that text 
>> formatting can be reproduced later with NLP annotations in some article 
>> viewer.
>> 
>> Next we would in another AE produce a pure text view of the text annotations 
>> in the XML view that need to be NL analysed. In this new text view we would 
>> annotate the different text elements with references back to their 
>> counterpart in the original XML view so that we can trace back positions in 
>> the original XML and the formatting relations. This of course will require 
>> mapping NLP annotation offsets in the text view back to the XML view but the 
>> information should then be there to make this possible.
>> 
>> This approach requires somewhat more handcrafted book keeping than we 
>> initially hoped would be necessary. We haven’t been able to find any 
>> examples of how this is usually done and the UIMA docs are vague regarding 
>> managing this kind of relationships across views. We would therefore really 
>> like to know if there is a simpler and better approach.
>> 
>> Any feedback is greatly appreciated. Thanks.
>

Re: Approach for keeping track of formatting associated with text views

Reply via email to