If I use the BodyContentHandler, it's easy to send the text I get back to a
language detector
ContentHandler handler = new BodyContentHandler(-1);
parser.parse(stream, handler, metadata, parseContext);
String str = handler.toString();
LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();
log.info("Language: " + detector.detectAll(str));
However, if I use ToXMLContentHandler(), it obviously has problems detecting
the language because of all the XML metadata. Is there an easy way to get the
body of the XHTML output?
I played around the Javax.xml.xpath, et al, but I'm not sure that the document
that comes back is a valid XML document.