Hello everyone!
I'm using Tika 1.25 to detect the language of a long text that I read from a PDF (using PDFBox 2.0.22): LanguageDetector detector = new OptimaizeLangDetector(); detector.loadModels(); List<LanguageResult> languages = detector.detectAll(text); The text is about 400 pages and most of it is in English, with a couple of pages in French, a few paragraphs in Greek and a couple of Arabic and German sentences. I know that language detection needs a long-ish text sample for the detection to work, so I'm fine with the short Arabic/German sentences not being detected. Running the code above with just a short sample in French or Greek, the detector finds the right language but if I use the whole text as input, the result is: en (0.9999969) = English with a 99.99969% probability It doesn't list the other languages. If I give the detector a mixed sample, it only detects both languages if they're about the same amount of text. If one part in e.g. French is 5 lines of text (~65 words) and the second in e.g. Greek is 7 lines of text (~80 word), the result is: el (0.99999815) = Greek With 55 words in French and 45 words in Greek the result is: fr (0.5714264) el (0.4285709) I also tried to do it the alternative way: detector.setMixedLanguages(true); detector.addText(text); List<LanguageResult> languages = detector.detectAll(); This also only lists a single language with the full text and my first French-Greek text sample. How do I get the other languages (in my case: French & Greek) as a result too?
