Roman--

I think you basically have two choices.

1. Process the HTML as is.  Store it in the CAS, use only annotators that 
understand HTML.  Most don't, but if you own all the annotators in your 
application, then it's possible.

Regarding your snippets, keep in mind that HTML tags are not a good boundary 
for tokens, sentences, and paragraphs.  For example, an HTML tag can occur in 
the middle of a word to indicate a change in font.  On the other hand, some 
tags do represent boundaries that aren't indicated with punctuation, for 
example, table cells should be paragraph boundaries, even if the text within 
isn't terminated with a '.' or '?'.

By the way, the ICU Unicode library has a pretty good language-specific 
tokenizer for plain text.  See http://icu-project.org


2. Parse the HTML into plain text, store that in the CAS, proceed as usual.  
This is standard, so all annotators understand plain text.

In #2, you may still want to retain the HTML information, so you can turn it 
into CAS annotations.   I describe this here: 
http://cwiki.apache.org/UIMA/uima-sandbox-components.html under "Document 
Model".

For parsing HTML, I've been using HTMLCleaner 
(http://htmlcleaner.sourceforge.net) to convert non-valid HTML to XML, and then 
the Woodstox StAX XML parser (http://woodstox.codehaus.org) to generate the 
structure annotations.

Of course, your annotators have to understand these structure annotations to 
use them.  But even if they don't, you can at least regenerate the HTML from 
the structure annotations, and even convert the offsets of other annotations to 
point into the HTML.  

Personally, I think this is a big hole in UIMA.  There should be a standard way 
to represent document structure (HTML, XML, Word, etc.) in UIMA.


Greg Holmberg


 -------------- Original message ----------------------
From: Roman Klinger <[EMAIL PROTECTED]>
> Dear UIMA users,
> 
> I am interested in using html files in the UIMA pipeline in a way that I 
> can keep track of found named entities in the files. In other words: I 
> do not want to convert the html to text and process these files but use 
> the original html tags e.g. for visualization enriched with found named 
> entities.
> 
> My plan is to use an html parser to find the text snippets of interest 
> in html files but I am not sure about the integration in UIMA. Did 
> anyone implement something like that already? In which way?
> 
> Thanks in advance,
> Roman
> 
> -- 
> Roman Klinger
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven
> D-53754 Sankt Augustin
> Tel.: +49-2241-14-2360
> Fax.: +49-2241-14-4-2360
> email: [EMAIL PROTECTED]
> 

Reply via email to