No problem. You can contact me anytime in case you have additional questions.
> On 14 Mar 2015, at 14:34 , Peter Klügl <[email protected]> wrote: > > Hi, > > > > thanks for the issue and sorry for the delayed response. I did not yet find > the time to look into it, but I will the next days. > > Best, > > Peter > > Am 13.03.2015 um 23:51 schrieb Mario Gazzo: >> The issue has now been created: >> >> https://issues.apache.org/jira/browse/UIMA-4286 >> <https://issues.apache.org/jira/browse/UIMA-4286> >> >> >>> On 11 Mar 2015, at 14:47 , Mario Gazzo <[email protected]> wrote: >>> >>> Thanks, I understand the choices now. I would also probably prefer to use >>> the document annotation if no text content is associated with the tag. >>> However, ideally I would prefer that tag annotations get the offsets of >>> content that is within their scope but otherwise get offsets of content >>> within their closest shared ancestor element. Ultimately this could end up >>> being the document annotation. E.g. >>> >>> <journal-meta> >>> <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id> >>> <journal-title>Environmental Health Perspectives</journal-title> >>> <issn pub-type="ppub">0091-6765</issn> >>> <publisher> >>> <publisher-name>National Institute of Environmental Health >>> Sciences</publisher-name> >>> </publisher> >>> </journal-meta> >>> >>> I would here expect journal-meta to have the offsets of all content within >>> its scope, which in the converted view of my experiments gets combined to >>> the following “Environ Health PerspectEnvironmental Health >>> Perspectives0091-6765National Institute of Environmental Health Sciences”. >>> This works as expected when I just disable the “inBody”-flag of the >>> HtmlConverterVisitor except that there is no clear separation between the >>> content elements any longer, which is why I would like to have a sentence >>> separator like “. ” between them so that I instead get: “Environ Health >>> Perspect. Environmental Health Perspectives. 0091-6765. National Institute >>> of Environmental Health Sciences.”. The dot separators should then of >>> course not be included in the converters offsets since they are not part of >>> the original text. >>> >>> Additionally there might be a case where a meta tag doesn’t have any >>> content within its scope but it contains attribute values: >>> >>> <Parent> >>> <Child1 attribute=“someValue” /> >>> <Child2/>Some content.</Child2> >>> </Parent> >>> >>> In this case I would prefer that Child1 has the same offsets as Child2 >>> since the tag is most closely related to that content. In case there is no >>> content within the scope of its parent then I would find the first ancestor >>> that contains content within its scope and use that offset although this >>> choice is questionable. I haven’t a good example of this case though so I >>> presume they are in reality rare. >>> >>> That said, the latter is more complicated to implement, so I would be happy >>> if I could just easily turn off the “inBody”-test in the >>> HtmlConverterVisitor and have some way to add content separation between >>> tags outside body without resorting to code modifications. >>> >>> Hope this feedback was helpful. >>> >>> Your time is much appreciated, thanks. >>> >>> >>>> On 09 Mar 2015, at 16:56 , Jens Grivolla <[email protected]> wrote: >>>> >>>> Hi Peter, while I don't think I will be using the HtmlConverter right away, >>>> I would vote for using the length of the document annotation for >>>> annotations that relate to the whole document (such as metadata). That >>>> makes them show up nicely in the CasEditor/Viewer and you could maintain it >>>> in all segments when you split a CAS (e.g. with something based on the >>>> SimpleTextSegmenter example). >>>> >>>> -- Jens >>>> >>>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> there is no way yet to customize this behavior. The HtmlConverter only >>>>> retains annotation of a length > 0 since annoations with length == 0 are >>>>> rather problematic and should be avoided. >>>>> >>>>> I can add a configuration parameter for keeping these annoations if you >>>>> want (best open an issue for it). What should be the offsets of the >>>>> annotations for elements in the head of the html document? 0, those of the >>>>> first token or those of the document annotation? >>>>> >>>>> Best, >>>>> >>>>> Peter >>>>> >>>>> >>>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo: >>>>> >>>>> We conducted some experiments with both the HtmlAnnotator and the >>>>>> HtmlConverter but we ran into an issue with the converter. It appears to >>>>>> only convert tag annotations that surround or are inside the body tag. >>>>>> Metadata elements like citations are ignored. The only way to get around >>>>>> this seems to be by forking and modifying the codebase, which I like to >>>>>> avoid. Both modules seem otherwise very useful to us but I am looking >>>>>> for a >>>>>> better approach to solve this issue. Is there some way to customise this >>>>>> behaviour without code modifications? >>>>>> >>>>>> Your input is appreciated, thanks. >>>>>> >>>>>> >>>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected]> wrote: >>>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will >>>>>>> have a closer look at it. >>>>>>> >>>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> you might want to take a look at two analysis engines of UIMA Ruta: >>>>>>>> HtmlAnnotator and HtmlConverter [1] >>>>>>>> >>>>>>>> The former one creates annotations for html element and therefore also >>>>>>>> for xml tags. The latter one creates a new view with only the plain >>>>>>>> text >>>>>>>> and adds existing annotations while adapting their offsets to the new >>>>>>>> document. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Peter >>>>>>>> >>>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html# >>>>>>>> ugr.tools.ruta.ae.html >>>>>>>> >>>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo: >>>>>>>> >>>>>>>>> We are starting to use the UIMA framework for NL processing article >>>>>>>>> text, which is usually stored with metadata in some XML format. We >>>>>>>>> need to >>>>>>>>> extract text elements to be processed by various NL analysis engines >>>>>>>>> that >>>>>>>>> only work with pure text but we also need to keep track of the >>>>>>>>> formatting >>>>>>>>> information related to the processed text. It is in general also >>>>>>>>> valuable >>>>>>>>> for us to be able to track every annotation back to the original XML >>>>>>>>> to >>>>>>>>> maintain provenance. Before embarking on this I like to validate our >>>>>>>>> approach with more experienced users since this is the first >>>>>>>>> application we >>>>>>>>> are building with UIMA. >>>>>>>>> >>>>>>>>> In the first step we would annotate every important element of the XML >>>>>>>>> including formatting elements in the body. We maintain some DOM-like >>>>>>>>> relationships between the body text and formatting annotations so >>>>>>>>> that text >>>>>>>>> formatting can be reproduced later with NLP annotations in some >>>>>>>>> article >>>>>>>>> viewer. >>>>>>>>> >>>>>>>>> Next we would in another AE produce a pure text view of the text >>>>>>>>> annotations in the XML view that need to be NL analysed. In this new >>>>>>>>> text >>>>>>>>> view we would annotate the different text elements with references >>>>>>>>> back to >>>>>>>>> their counterpart in the original XML view so that we can trace back >>>>>>>>> positions in the original XML and the formatting relations. This of >>>>>>>>> course >>>>>>>>> will require mapping NLP annotation offsets in the text view back to >>>>>>>>> the >>>>>>>>> XML view but the information should then be there to make this >>>>>>>>> possible. >>>>>>>>> >>>>>>>>> This approach requires somewhat more handcrafted book keeping than we >>>>>>>>> initially hoped would be necessary. We haven’t been able to find any >>>>>>>>> examples of how this is usually done and the UIMA docs are vague >>>>>>>>> regarding >>>>>>>>> managing this kind of relationships across views. We would therefore >>>>>>>>> really >>>>>>>>> like to know if there is a simpler and better approach. >>>>>>>>> >>>>>>>>> Any feedback is greatly appreciated. Thanks. >>>>>>>>> >> >
