Hi,
there is no way yet to customize this behavior. The HtmlConverter only
retains annotation of a length > 0 since annoations with length == 0 are
rather problematic and should be avoided.
I can add a configuration parameter for keeping these annoations if you
want (best open an issue for it). What should be the offsets of the
annotations for elements in the head of the html document? 0, those of
the first token or those of the document annotation?
Best,
Peter
Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
We conducted some experiments with both the HtmlAnnotator and the HtmlConverter
but we ran into an issue with the converter. It appears to only convert tag
annotations that surround or are inside the body tag. Metadata elements like
citations are ignored. The only way to get around this seems to be by forking
and modifying the codebase, which I like to avoid. Both modules seem otherwise
very useful to us but I am looking for a better approach to solve this issue.
Is there some way to customise this behaviour without code modifications?
Your input is appreciated, thanks.
On 18 Feb 2015, at 23:03 , Mario Gazzo <[email protected]> wrote:
Thanks. Looks interesting, seems that it could fit our use case. We will have a
closer look at it.
On 18 Feb 2015, at 21:58 , Peter Klügl <[email protected]> wrote:
Hi,
you might want to take a look at two analysis engines of UIMA Ruta:
HtmlAnnotator and HtmlConverter [1]
The former one creates annotations for html element and therefore also for xml
tags. The latter one creates a new view with only the plain text and adds
existing annotations while adapting their offsets to the new document.
Best,
Peter
[1]
http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
We are starting to use the UIMA framework for NL processing article text, which
is usually stored with metadata in some XML format. We need to extract text
elements to be processed by various NL analysis engines that only work with
pure text but we also need to keep track of the formatting information related
to the processed text. It is in general also valuable for us to be able to
track every annotation back to the original XML to maintain provenance. Before
embarking on this I like to validate our approach with more experienced users
since this is the first application we are building with UIMA.
In the first step we would annotate every important element of the XML
including formatting elements in the body. We maintain some DOM-like
relationships between the body text and formatting annotations so that text
formatting can be reproduced later with NLP annotations in some article viewer.
Next we would in another AE produce a pure text view of the text annotations in
the XML view that need to be NL analysed. In this new text view we would
annotate the different text elements with references back to their counterpart
in the original XML view so that we can trace back positions in the original
XML and the formatting relations. This of course will require mapping NLP
annotation offsets in the text view back to the XML view but the information
should then be there to make this possible.
This approach requires somewhat more handcrafted book keeping than we initially
hoped would be necessary. We haven’t been able to find any examples of how this
is usually done and the UIMA docs are vague regarding managing this kind of
relationships across views. We would therefore really like to know if there is
a simpler and better approach.
Any feedback is greatly appreciated. Thanks.