WG: Detecting multiple languages in a long text

Julia Ruzicka Mon, 01 Feb 2021 05:40:05 -0800

Hello everyone!


I'm using Tika 1.25 to detect the language of a long text that I read from a
PDF (using PDFBox 2.0.22):

 

LanguageDetector detector = new OptimaizeLangDetector();

detector.loadModels();

List<LanguageResult> languages = detector.detectAll(text);

 

The text is about 400 pages and most of it is in English, with a couple of
pages in French, a few paragraphs in Greek and a couple of Arabic and German
sentences.

I know that language detection needs a long-ish text sample for the
detection to work, so I'm fine with the short Arabic/German sentences not
being detected. Running the code above with just a short sample in French or
Greek, the detector finds the right language but if I use the whole text as
input, the result is:

en (0.9999969) = English with a 99.99969% probability

 

It doesn't list the other languages.

 

If I give the detector a mixed sample, it only detects both languages if
they're about the same amount of text.

If one part in e.g. French is 5 lines of text (~65 words) and the second in
e.g. Greek is 7 lines of text (~80 word), the result is:

el (0.99999815) = Greek

 

With 55 words in French and 45 words in Greek the result is:

fr (0.5714264)

el (0.4285709)

 

I also tried to do it the alternative way:

 

detector.setMixedLanguages(true);

detector.addText(text);

List<LanguageResult> languages = detector.detectAll();

 

This also only lists a single language with the full text and my first
French-Greek text sample.

 

How do I get the other languages (in my case: French & Greek) as a result
too?

WG: Detecting multiple languages in a long text

Reply via email to