There's a variation to Greg's #2, which may or may not apply
to Roman's issue.  Use an html parser to extract plain text from
the source documents, but keep an offset mapping of the extracted
text back into the original html.  Process text, then use mapping
to map results back to the html original.  This can be used to
create highlighting of entities in the original html, which I
suspect is what Roman is after.

To do this is not trivial.  What we have done is to use an html
parser that creates a DOM tree of the html document.  Then we
walk the DOM tree and extract the text we're interested in, keeping
a record of what text is associated with what node.  After UIMA
processing, we insert the annotations of interest directly into
the DOM tree, as new nodes.  Again, this is not trivial as there
may be overlap between original nodes and extracted entities.  You
can then render the modified DOM tree as html for display.

--Thilo

[EMAIL PROTECTED] wrote:
Roman--


I think you basically have two choices.

1. Process the HTML as is.  Store it in the CAS, use only annotators that 
understand HTML.  Most don't, but if you own all the annotators in your 
application, then it's possible.

Regarding your snippets, keep in mind that HTML tags are not a good boundary 
for tokens, sentences, and paragraphs.  For example, an HTML tag can occur in 
the middle of a word to indicate a change in font.  On the other hand, some 
tags do represent boundaries that aren't indicated with punctuation, for 
example, table cells should be paragraph boundaries, even if the text within 
isn't terminated with a '.' or '?'.

By the way, the ICU Unicode library has a pretty good language-specific 
tokenizer for plain text.  See http://icu-project.org


2. Parse the HTML into plain text, store that in the CAS, proceed as usual.  
This is standard, so all annotators understand plain text.

In #2, you may still want to retain the HTML information, so you can turn it into CAS 
annotations.   I describe this here: 
http://cwiki.apache.org/UIMA/uima-sandbox-components.html under "Document 
Model".

For parsing HTML, I've been using HTMLCleaner 
(http://htmlcleaner.sourceforge.net) to convert non-valid HTML to XML, and then 
the Woodstox StAX XML parser (http://woodstox.codehaus.org) to generate the 
structure annotations.

Of course, your annotators have to understand these structure annotations to use them. But even if they don't, you can at least regenerate the HTML from the structure annotations, and even convert the offsets of other annotations to point into the HTML.
Personally, I think this is a big hole in UIMA.  There should be a standard way 
to represent document structure (HTML, XML, Word, etc.) in UIMA.


Greg Holmberg


 -------------- Original message ----------------------
From: Roman Klinger <[EMAIL PROTECTED]>
Dear UIMA users,

I am interested in using html files in the UIMA pipeline in a way that I can keep track of found named entities in the files. In other words: I do not want to convert the html to text and process these files but use the original html tags e.g. for visualization enriched with found named entities.

My plan is to use an html parser to find the text snippets of interest in html files but I am not sure about the integration in UIMA. Did anyone implement something like that already? In which way?

Thanks in advance,
Roman

--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: [EMAIL PROTECTED]


Reply via email to