Try the LanguageHandler()?

On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <[email protected]>
wrote:

> If I use the BodyContentHandler, it’s easy to send the text I get back to
> a language detector
>
>
>
> ContentHandler handler = *new *BodyContentHandler(-1);
>
> parser.parse(stream, handler, metadata, parseContext);
>
>
>
> String str = handler.toString();
>
>
>
> LanguageDetector detector = *new *OptimaizeLangDetector();
> detector.loadModels();
>
> *log*.info(*"Language: " *+ detector.detectAll(str));
>
>
>
>
>
>
>
> However, if I use ToXMLContentHandler(), it obviously has problems detecting 
> the language because of all the XML metadata.  Is there an easy way to get 
> the body of the XHTML output?
>
> I played around the Javax.xml.xpath, et al, but I’m not sure that the 
> document that comes back is a valid XML document.
>
>
>
>
>

Reply via email to