The following little program should do the job for
you.
/* HTMLTextStripper.java
* July 15, 2006
*/
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
/** HTMLTextStripper
* @author Charles Bell
* @version July 15, 2006
*/
public class HTM
Ok I got it fixed and though I would respond back so it was in the archive for
the next poor soul...
Here's the code I used:
StringBean sb = new StringBean ();
String htmlSource = record.get("column14").toString().trim();
Parser parser = new Parser(new Lexer(htmlSource));
I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e
It's pretty stand-alone, s