Hi,

the HtmlConverter was built to create an annotated document containing the plain text of the html or xml source. It intends to remove all elements that would not be visible for one that takes a look at the interpreted html, e.g., in an html browser. Thus, it removes a lot of text of the document and updates the offsets of the existing annotations. Text, e.g., in the head of the html document, is therefore completely removed. The annotations in those text areas retain no meaningful values for their begin and end offsets (features). If we want to keep these annotations, then the questions arises which offsets they should get. Normally, I would asume that their begin and end is set to 0. However, this can be problematic when one wants to apply Ruta rules on the resulting CAS. (This is probably not really related to the task you want to solve, but as a component of the ruta project, I have to take care that there wont't arise problems.) Another option is to assign the offsets of the document annotation. We could also forget the offsets and use feature structures instead of annotations, but this is not intended by the corresponding type system.

I am not sure that I fully caught your use case, e.g., the thing with the sentences. Can you maybe provide a minimal example for the provided input and the desired output?

Best,

Peter

Am 10.03.2015 um 13:54 schrieb Mario Gazzo:
Thanks, I can of course open an issue for this.

I have been playing with a modified version of the HTMLConverter, which is why 
my reply is delayed. I disabled the ‘inBody’-flag inside the 
HTMLConverterVisitor to get an idea of what the effects might be. It pretty 
much did want I thought I wanted except that there is no clear sentence 
boundaries between many of the metadata strings. Most of them are not really 
meaningful to NL process but a few we would want to analyse but the sentence 
separation is gone now. I have been looking at some of the conversion and line 
break options to get around this but I haven't found a good approach yet. I 
really only want to introduce some sentence separation like “. “ between 
different tag content outside the body.

I am not sure I understand your offset question. Would you mind elaborating 
this to me? Our documents are in XML with a single body element containing HTML.


On 07 Mar 2015, at 17:33 , Peter Klügl <[email protected] 
<mailto:[email protected]>> wrote:

Hi,

there is no way yet to customize this behavior. The HtmlConverter only retains 
annotation of a length > 0 since annoations with length == 0 are rather 
problematic and should be avoided.

I can add a configuration parameter for keeping these annoations if you want 
(best open an issue for it). What should be the offsets of the annotations for 
elements in the head of the html document? 0, those of the first token or those 
of the document annotation?

Best,

Peter


Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
We conducted some experiments with both the HtmlAnnotator and the HtmlConverter 
but we ran into an issue with the converter. It appears to only convert tag 
annotations that surround or are inside the body tag. Metadata elements like 
citations are ignored. The only way to get around this seems to be by forking 
and modifying the codebase, which I like to avoid. Both modules seem otherwise 
very useful to us but I am looking for a better approach to solve this issue. 
Is there some way to customise this behaviour without code modifications?

Your input is appreciated, thanks.


On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected] 
<mailto:[email protected]>> wrote:

Thanks. Looks interesting, seems that it could fit our use case. We will have a 
closer look at it.

On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected] 
<mailto:[email protected]>> wrote:

Hi,

you might want to take a look at two analysis engines of UIMA Ruta: 
HtmlAnnotator and HtmlConverter [1]

The former one creates annotations for html element and therefore also for xml 
tags. The latter one creates a new view with only the plain text and adds 
existing annotations while adapting their offsets to the new document.

Best,

Peter

[1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html 
<http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html>

Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
We are starting to use the UIMA framework for NL processing article text, which 
is usually stored with metadata in some XML format. We need to extract text 
elements to be processed by various NL analysis engines that only work with 
pure text but we also need to keep track of the formatting information related 
to the processed text. It is in general also valuable for us to be able to 
track every annotation back to the original XML to maintain provenance. Before 
embarking on this I like to validate our approach with more experienced users 
since this is the first application we are building with UIMA.

In the first step we would annotate every important element of the XML 
including formatting elements in the body. We maintain some DOM-like 
relationships between the body text and formatting annotations so that text 
formatting can be reproduced later with NLP annotations in some article viewer.

Next we would in another AE produce a pure text view of the text annotations in 
the XML view that need to be NL analysed. In this new text view we would 
annotate the different text elements with references back to their counterpart 
in the original XML view so that we can trace back positions in the original 
XML and the formatting relations. This of course will require mapping NLP 
annotation offsets in the text view back to the XML view but the information 
should then be there to make this possible.

This approach requires somewhat more handcrafted book keeping than we initially 
hoped would be necessary. We haven’t been able to find any examples of how this 
is usually done and the UIMA docs are vague regarding managing this kind of 
relationships across views. We would therefore really like to know if there is 
a simpler and better approach.

Any feedback is greatly appreciated. Thanks.


Reply via email to