Re: Approach for keeping track of formatting associated with text views

Mario Gazzo Tue, 10 Mar 2015 05:55:36 -0700

Thanks, I can of course open an issue for this.

I have been playing with a modified version of the HTMLConverter, which is why 
my reply is delayed. I disabled the ‘inBody’-flag inside the 
HTMLConverterVisitor to get an idea of what the effects might be. It pretty 
much did want I thought I wanted except that there is no clear sentence 
boundaries between many of the metadata strings. Most of them are not really 
meaningful to NL process but a few we would want to analyse but the sentence 
separation is gone now. I have been looking at some of the conversion and line 
break options to get around this but I haven't found a good approach yet. I 
really only want to introduce some sentence separation like “. “ between 
different tag content outside the body.


I am not sure I understand your offset question. Would you mind elaborating 
this to me? Our documents are in XML with a single body element containing HTML.


> On 07 Mar 2015, at 17:33 , Peter Klügl <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hi,
> 
> there is no way yet to customize this behavior. The HtmlConverter only 
> retains annotation of a length > 0 since annoations with length == 0 are 
> rather problematic and should be avoided.
> 
> I can add a configuration parameter for keeping these annoations if you want 
> (best open an issue for it). What should be the offsets of the annotations 
> for elements in the head of the html document? 0, those of the first token or 
> those of the document annotation?
> 
> Best,
> 
> Peter
> 
> 
> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>> We conducted some experiments with both the HtmlAnnotator and the 
>> HtmlConverter but we ran into an issue with the converter. It appears to 
>> only convert tag annotations that surround or are inside the body tag. 
>> Metadata elements like citations are ignored. The only way to get around 
>> this seems to be by forking and modifying the codebase, which I like to 
>> avoid. Both modules seem otherwise very useful to us but I am looking for a 
>> better approach to solve this issue. Is there some way to customise this 
>> behaviour without code modifications?
>> 
>> Your input is appreciated, thanks.
>> 
>> 
>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Thanks. Looks interesting, seems that it could fit our use case. We will 
>>> have a closer look at it.
>>> 
>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> you might want to take a look at two analysis engines of UIMA Ruta: 
>>>> HtmlAnnotator and HtmlConverter [1]
>>>> 
>>>> The former one creates annotations for html element and therefore also for 
>>>> xml tags. The latter one creates a new view with only the plain text and 
>>>> adds existing annotations while adapting their offsets to the new document.
>>>> 
>>>> Best,
>>>> 
>>>> Peter
>>>> 
>>>> [1] 
>>>> http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
>>>>  
>>>> <http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html>
>>>> 
>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>> We are starting to use the UIMA framework for NL processing article text, 
>>>>> which is usually stored with metadata in some XML format. We need to 
>>>>> extract text elements to be processed by various NL analysis engines that 
>>>>> only work with pure text but we also need to keep track of the formatting 
>>>>> information related to the processed text. It is in general also valuable 
>>>>> for us to be able to track every annotation back to the original XML to 
>>>>> maintain provenance. Before embarking on this I like to validate our 
>>>>> approach with more experienced users since this is the first application 
>>>>> we are building with UIMA.
>>>>> 
>>>>> In the first step we would annotate every important element of the XML 
>>>>> including formatting elements in the body. We maintain some DOM-like 
>>>>> relationships between the body text and formatting annotations so that 
>>>>> text formatting can be reproduced later with NLP annotations in some 
>>>>> article viewer.
>>>>> 
>>>>> Next we would in another AE produce a pure text view of the text 
>>>>> annotations in the XML view that need to be NL analysed. In this new text 
>>>>> view we would annotate the different text elements with references back 
>>>>> to their counterpart in the original XML view so that we can trace back 
>>>>> positions in the original XML and the formatting relations. This of 
>>>>> course will require mapping NLP annotation offsets in the text view back 
>>>>> to the XML view but the information should then be there to make this 
>>>>> possible.
>>>>> 
>>>>> This approach requires somewhat more handcrafted book keeping than we 
>>>>> initially hoped would be necessary. We haven’t been able to find any 
>>>>> examples of how this is usually done and the UIMA docs are vague 
>>>>> regarding managing this kind of relationships across views. We would 
>>>>> therefore really like to know if there is a simpler and better approach.
>>>>> 
>>>>> Any feedback is greatly appreciated. Thanks.
>

Re: Approach for keeping track of formatting associated with text views

Reply via email to