If I use the BodyContentHandler, it's easy to send the text I get back to a 
language detector

ContentHandler handler = new BodyContentHandler(-1);

parser.parse(stream, handler, metadata, parseContext);



String str = handler.toString();



LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();

log.info("Language: " + detector.detectAll(str));







However, if I use ToXMLContentHandler(), it obviously has problems detecting 
the language because of all the XML metadata.  Is there an easy way to get the 
body of the XHTML output?

I played around the Javax.xml.xpath, et al, but I'm not sure that the document 
that comes back is a valid XML document.


Reply via email to