subject:"Re\: HTMLParser"

Re: HTMLParser

2006-07-15 Thread Charles Bell

The following little program should do the job for you. /* HTMLTextStripper.java * July 15, 2006 */ import java.io.*; import org.xml.sax.*; import org.xml.sax.helpers.*; import javax.xml.parsers.*; /** HTMLTextStripper * @author Charles Bell * @version July 15, 2006 */ public class HTM

RE: HTMLParser

2006-07-14 Thread Ross Rankin

Ok I got it fixed and though I would respond back so it was in the archive for the next poor soul... Here's the code I used: StringBean sb = new StringBean (); String htmlSource = record.get("column14").toString().trim(); Parser parser = new Parser(new Lexer(htmlSource));

Re: HTMLParser

2006-07-13 Thread Yonik Seeley

I've never used HTMLParser, but if you have malformed., incomplete, or optional HTML that would otherwise choke an HTML parser, you could use Solr's HTMLStripping: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e It's pretty stand-alone, s