Thanks, I can of course open an issue for this. I have been playing with a modified version of the HTMLConverter, which is why my reply is delayed. I disabled the ‘inBody’-flag inside the HTMLConverterVisitor to get an idea of what the effects might be. It pretty much did want I thought I wanted except that there is no clear sentence boundaries between many of the metadata strings. Most of them are not really meaningful to NL process but a few we would want to analyse but the sentence separation is gone now. I have been looking at some of the conversion and line break options to get around this but I haven't found a good approach yet. I really only want to introduce some sentence separation like “. “ between different tag content outside the body.
I am not sure I understand your offset question. Would you mind elaborating this to me? Our documents are in XML with a single body element containing HTML. > On 07 Mar 2015, at 17:33 , Peter Klügl <[email protected] > <mailto:[email protected]>> wrote: > > Hi, > > there is no way yet to customize this behavior. The HtmlConverter only > retains annotation of a length > 0 since annoations with length == 0 are > rather problematic and should be avoided. > > I can add a configuration parameter for keeping these annoations if you want > (best open an issue for it). What should be the offsets of the annotations > for elements in the head of the html document? 0, those of the first token or > those of the document annotation? > > Best, > > Peter > > > Am 06.03.2015 um 14:00 schrieb Mario Gazzo: >> We conducted some experiments with both the HtmlAnnotator and the >> HtmlConverter but we ran into an issue with the converter. It appears to >> only convert tag annotations that surround or are inside the body tag. >> Metadata elements like citations are ignored. The only way to get around >> this seems to be by forking and modifying the codebase, which I like to >> avoid. Both modules seem otherwise very useful to us but I am looking for a >> better approach to solve this issue. Is there some way to customise this >> behaviour without code modifications? >> >> Your input is appreciated, thanks. >> >> >>> On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Thanks. Looks interesting, seems that it could fit our use case. We will >>> have a closer look at it. >>> >>>> On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi, >>>> >>>> you might want to take a look at two analysis engines of UIMA Ruta: >>>> HtmlAnnotator and HtmlConverter [1] >>>> >>>> The former one creates annotations for html element and therefore also for >>>> xml tags. The latter one creates a new view with only the plain text and >>>> adds existing annotations while adapting their offsets to the new document. >>>> >>>> Best, >>>> >>>> Peter >>>> >>>> [1] >>>> http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html >>>> >>>> <http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html> >>>> >>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo: >>>>> We are starting to use the UIMA framework for NL processing article text, >>>>> which is usually stored with metadata in some XML format. We need to >>>>> extract text elements to be processed by various NL analysis engines that >>>>> only work with pure text but we also need to keep track of the formatting >>>>> information related to the processed text. It is in general also valuable >>>>> for us to be able to track every annotation back to the original XML to >>>>> maintain provenance. Before embarking on this I like to validate our >>>>> approach with more experienced users since this is the first application >>>>> we are building with UIMA. >>>>> >>>>> In the first step we would annotate every important element of the XML >>>>> including formatting elements in the body. We maintain some DOM-like >>>>> relationships between the body text and formatting annotations so that >>>>> text formatting can be reproduced later with NLP annotations in some >>>>> article viewer. >>>>> >>>>> Next we would in another AE produce a pure text view of the text >>>>> annotations in the XML view that need to be NL analysed. In this new text >>>>> view we would annotate the different text elements with references back >>>>> to their counterpart in the original XML view so that we can trace back >>>>> positions in the original XML and the formatting relations. This of >>>>> course will require mapping NLP annotation offsets in the text view back >>>>> to the XML view but the information should then be there to make this >>>>> possible. >>>>> >>>>> This approach requires somewhat more handcrafted book keeping than we >>>>> initially hoped would be necessary. We haven’t been able to find any >>>>> examples of how this is usually done and the UIMA docs are vague >>>>> regarding managing this kind of relationships across views. We would >>>>> therefore really like to know if there is a simpler and better approach. >>>>> >>>>> Any feedback is greatly appreciated. Thanks. >
