-------------- Original message ---------------------- From: "Adam Lally" <[EMAIL PROTECTED]> > On 4/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > That leaves me with a mix of HTML and plain text annotators, annotating > different artifacts. The problem with that is I can't compare the > annotations. > A plain text annotator and an HTML annotator may have annotated the same > logical > text (same word, for example), but I have no way of determining that. So > that > means I can't answer questions that require both annotations. > > > > I have heard of people doing HTML detagging and meanwhile recording > offset mappings that will allow them to determine which offsets in the > detagged text correspond to which offsets in the original HTML. After > analyzing the detagged text, this offset mapping can then be used, for > example, to highlight annotated spans in the HTML. I don't have the > code for this, though. It sounds like it could be a very useful > utility for UIMA users. > > If you had that kind of information produced by your HTML detagger, > would it address your requirements? > > -Adam
That would work. I'm using the Neko parser (http://people.apache.org/~andyc/neko/doc/html/index.html), which plugs into SAX and provides data via the SAX ContentHandler callbacks: http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html I don't see a way with Neko or SAX to get position information while parsing. The text comes in via the characters() callback, and doesn't include the position within the HTML. The JericoParser looks like a possibility, since it tracks the positions of the nodes. The author compares Jerico to other parsers here: http://jerichohtml.sourceforge.net/doc/index.html Some other HTML parsers: http://htmlparser.sourceforge.net http://jtidy.sourceforge.net http://mercury.ccil.org/~cowan/XML/tagsoup http://hotsax.sourceforge.net http://html.xamjwg.org/cobra.jsp http://htmlcleaner.sourceforge.net http://sourceforge.net/projects/mozillaparser Greg
