Re: plain text or HTML in the CAS?

[EMAIL PROTECTED] Thu, 26 Apr 2007 14:48:28 -0700

 -------------- Original message ----------------------
From: "Adam Lally" <[EMAIL PROTECTED]>
> On 4/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > That leaves me with a mix of HTML and plain text annotators, annotating 
> different artifacts.  The problem with that is I can't compare the 
> annotations.  
> A plain text annotator and an HTML annotator may have annotated the same 
> logical 
> text (same word, for example), but I have no way of determining that.  So 
> that 
> means I can't answer questions that require both annotations.
> >
> 
> I have heard of people doing HTML detagging and meanwhile recording
> offset mappings that will allow them to determine which offsets in the
> detagged text correspond to which offsets in the original HTML.  After
> analyzing the detagged text, this offset mapping can then be used, for
> example, to highlight annotated spans in the HTML.  I don't have the
> code for this, though.  It sounds like it could be a very useful
> utility for UIMA users.
> 
> If you had that kind of information produced by your HTML detagger,
> would it address your requirements?
> 
> -Adam


That would work.

I'm using the Neko parser 
(http://people.apache.org/~andyc/neko/doc/html/index.html), which plugs into 
SAX and provides data via the SAX ContentHandler callbacks: 

http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html 

I don't see a way with Neko or SAX to get position information while parsing.  
The text comes in via the characters() callback, and doesn't include the 
position within the HTML.


The JericoParser looks like a possibility, since it tracks the positions of the 
nodes.  The author compares Jerico to other parsers here:

http://jerichohtml.sourceforge.net/doc/index.html


Some other HTML parsers: 

http://htmlparser.sourceforge.net
http://jtidy.sourceforge.net
http://mercury.ccil.org/~cowan/XML/tagsoup
http://hotsax.sourceforge.net
http://html.xamjwg.org/cobra.jsp
http://htmlcleaner.sourceforge.net 
http://sourceforge.net/projects/mozillaparser 


Greg

Re: plain text or HTML in the CAS?

Reply via email to