Cool, didn’t know about that.  But that doesn’t seem to be able to return the 
text that it got.  Can I do both?

From: Tim Allison <[email protected]>
Sent: Wednesday, January 13, 2021 2:34 PM
To: [email protected]
Subject: Re: Getting language of parsed text

Try the LanguageHandler()?

On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
If I use the BodyContentHandler, it’s easy to send the text I get back to a 
language detector

ContentHandler handler = new BodyContentHandler(-1);

parser.parse(stream, handler, metadata, parseContext);



String str = handler.toString();



LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();

log.info("Language: " + detector.detectAll(str));







However, if I use ToXMLContentHandler(), it obviously has problems detecting 
the language because of all the XML metadata.  Is there an easy way to get the 
body of the XHTML output?

I played around the Javax.xml.xpath, et al, but I’m not sure that the document 
that comes back is a valid XML document.


Reply via email to