Re: Approach for keeping track of formatting associated with text views

Mario Gazzo Sat, 14 Mar 2015 06:51:06 -0700

No problem. You can contact me anytime in case you have additional questions.


> On 14 Mar 2015, at 14:34 , Peter Klügl <[email protected]> wrote:
> 
> Hi,
> 
> 
> 
> thanks for the issue and sorry for the delayed response. I did not yet find 
> the time to look into it, but I will the next days.
> 
> Best,
> 
> Peter
> 
> Am 13.03.2015 um 23:51 schrieb Mario Gazzo:
>> The issue has now been created:
>> 
>> https://issues.apache.org/jira/browse/UIMA-4286 
>> <https://issues.apache.org/jira/browse/UIMA-4286>
>> 
>> 
>>> On 11 Mar 2015, at 14:47 , Mario Gazzo <[email protected]> wrote:
>>> 
>>> Thanks, I understand the choices now. I would also probably prefer to use 
>>> the document annotation if no text content is associated with the tag. 
>>> However, ideally I would prefer that tag annotations get the offsets of 
>>> content that is within their scope but otherwise get offsets of content 
>>> within their closest shared ancestor element. Ultimately this could end up 
>>> being the document annotation. E.g.
>>> 
>>> <journal-meta>
>>>    <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>>>    <journal-title>Environmental Health Perspectives</journal-title>
>>>    <issn pub-type="ppub">0091-6765</issn>
>>>    <publisher>
>>>        <publisher-name>National Institute of Environmental Health 
>>> Sciences</publisher-name>
>>>    </publisher>
>>> </journal-meta>
>>> 
>>> I would here expect journal-meta to have the offsets of all content within 
>>> its scope, which in the converted view of my experiments gets combined to 
>>> the following “Environ Health PerspectEnvironmental Health 
>>> Perspectives0091-6765National Institute of Environmental Health Sciences”. 
>>> This works as expected when I just disable the “inBody”-flag of the 
>>> HtmlConverterVisitor except that there is no clear separation between the 
>>> content elements any longer, which is why I would like to have a sentence 
>>> separator like “. ” between them so that I instead get: “Environ Health 
>>> Perspect. Environmental Health Perspectives. 0091-6765. National Institute 
>>> of Environmental Health Sciences.”. The dot separators should then of 
>>> course not be included in the converters offsets since they are not part of 
>>> the original text.
>>> 
>>> Additionally there might be a case where a meta tag doesn’t have any 
>>> content within its scope but it contains attribute values:
>>> 
>>> <Parent>
>>>     <Child1 attribute=“someValue” />
>>>     <Child2/>Some content.</Child2>
>>> </Parent>
>>> 
>>> In this case I would prefer that Child1 has the same offsets as Child2 
>>> since the tag is most closely related to that content. In case there is no 
>>> content within the scope of its parent then I would find the first ancestor 
>>> that contains content within its scope and use that offset although this 
>>> choice is questionable. I haven’t a good example of this case though so I 
>>> presume they are in reality rare.
>>> 
>>> That said, the latter is more complicated to implement, so I would be happy 
>>> if I could just easily turn off the “inBody”-test in the 
>>> HtmlConverterVisitor and have some way to add content separation between 
>>> tags outside body without resorting to code modifications.
>>> 
>>> Hope this feedback was helpful.
>>> 
>>> Your time is much appreciated, thanks.
>>> 
>>> 
>>>> On 09 Mar 2015, at 16:56 , Jens Grivolla <[email protected]> wrote:
>>>> 
>>>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>>>> I would vote for using the length of the document annotation for
>>>> annotations that relate to the whole document (such as metadata).  That
>>>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>>>> in all segments when you split a CAS (e.g. with something based on the
>>>> SimpleTextSegmenter example).
>>>> 
>>>> -- Jens
>>>> 
>>>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> there is no way yet to customize this behavior. The HtmlConverter only
>>>>> retains annotation of a length > 0 since annoations with length == 0 are
>>>>> rather problematic and should be avoided.
>>>>> 
>>>>> I can add a configuration parameter for keeping these annoations if you
>>>>> want (best open an issue for it). What should be the offsets of the
>>>>> annotations for elements in the head of the html document? 0, those of the
>>>>> first token or those of the document annotation?
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> 
>>>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>>>> 
>>>>> We conducted some experiments with both the HtmlAnnotator and the
>>>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>>>> only convert tag annotations that surround or are inside the body tag.
>>>>>> Metadata elements like citations are ignored. The only way to get around
>>>>>> this seems to be by forking and modifying the codebase, which I like to
>>>>>> avoid. Both modules seem otherwise very useful to us but I am looking 
>>>>>> for a
>>>>>> better approach to solve this issue. Is there some way to customise this
>>>>>> behaviour without code modifications?
>>>>>> 
>>>>>> Your input is appreciated, thanks.
>>>>>> 
>>>>>> 
>>>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected]> wrote:
>>>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>>>> have a closer look at it.
>>>>>>> 
>>>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>>>> 
>>>>>>>> The former one creates annotations for html element and therefore also
>>>>>>>> for xml tags. The latter one creates a new view with only the plain 
>>>>>>>> text
>>>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>>>> document.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> 
>>>>>>>> Peter
>>>>>>>> 
>>>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>>>> ugr.tools.ruta.ae.html
>>>>>>>> 
>>>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>>>> 
>>>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>>>> text, which is usually stored with metadata in some XML format. We 
>>>>>>>>> need to
>>>>>>>>> extract text elements to be processed by various NL analysis engines 
>>>>>>>>> that
>>>>>>>>> only work with pure text but we also need to keep track of the 
>>>>>>>>> formatting
>>>>>>>>> information related to the processed text. It is in general also 
>>>>>>>>> valuable
>>>>>>>>> for us to be able to track every annotation back to the original XML 
>>>>>>>>> to
>>>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>>>> approach with more experienced users since this is the first 
>>>>>>>>> application we
>>>>>>>>> are building with UIMA.
>>>>>>>>> 
>>>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>>>> relationships between the body text and formatting annotations so 
>>>>>>>>> that text
>>>>>>>>> formatting can be reproduced later with NLP annotations in some 
>>>>>>>>> article
>>>>>>>>> viewer.
>>>>>>>>> 
>>>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>>>> annotations in the XML view that need to be NL analysed. In this new 
>>>>>>>>> text
>>>>>>>>> view we would annotate the different text elements with references 
>>>>>>>>> back to
>>>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>>>> positions in the original XML and the formatting relations. This of 
>>>>>>>>> course
>>>>>>>>> will require mapping NLP annotation offsets in the text view back to 
>>>>>>>>> the
>>>>>>>>> XML view but the information should then be there to make this 
>>>>>>>>> possible.
>>>>>>>>> 
>>>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>>>> examples of how this is usually done and the UIMA docs are vague 
>>>>>>>>> regarding
>>>>>>>>> managing this kind of relationships across views. We would therefore 
>>>>>>>>> really
>>>>>>>>> like to know if there is a simpler and better approach.
>>>>>>>>> 
>>>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>>>> 
>> 
>

Re: Approach for keeping track of formatting associated with text views

Reply via email to