Cool, didn’t know about that. But that doesn’t seem to be able to return the text that it got. Can I do both?
From: Tim Allison <[email protected]> Sent: Wednesday, January 13, 2021 2:34 PM To: [email protected] Subject: Re: Getting language of parsed text Try the LanguageHandler()? On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: If I use the BodyContentHandler, it’s easy to send the text I get back to a language detector ContentHandler handler = new BodyContentHandler(-1); parser.parse(stream, handler, metadata, parseContext); String str = handler.toString(); LanguageDetector detector = new OptimaizeLangDetector(); detector.loadModels(); log.info("Language: " + detector.detectAll(str)); However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata. Is there an easy way to get the body of the XHTML output? I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
