Hi Julia,

Comments inline below…

— Ken

> On Feb 2, 2021, at 7:08 AM, Julia Ruzicka <[email protected]> wrote:
> 
> Hi Ken,
>  
> yes, exactly. I don’t care about single sentences because those could give 
> wrong results but if it’s, let’s say, more than maybe 5 sentences, so 
> basically something the language of which can be detected without a problem, 
> then it should show up in the results. This could probably be done by 
> ignoring every language that’s less than … % but for that to work I’d 
> actually need the full list first.
>  
> Splitting into pages won’t work because it already doesn‘t output the second 
> language if it’s just one less sentence, so a page with 3 paragraphs in 
> English and 1 paragraph in e.g. Greek is always going to result in „en 
> (0.999xxxx)“.
> I don’t know how it calculates the probabilities but with that example I’d 
> expect the result:
> en (0.75)
> el (0.25)

First, a quick sketch of the algorithm…

Leaving aside setting a priori probabilities, initially it sets every language 
to the same probability. It then starts collecting ngrams (say 4 character 
sequences) from the text. For each ngram, it does a lookup (in the various 
language models that have been loaded) to see if it “knows” about that ngram. 
If it does, then it calculates the probably for each language that the ngram 
would be found in text for that language, versus the other languages. Each 
ngram probability is multiplied together with previous ngram probabilities 
(with regular normalization back to 100%) to determine each language’s 
probability.

If a language model doesn’t contain an ngram, then a probability of 0 isn’t 
used, as that would instantly (and forever) drop the probability of that 
language to 0. Instead, a small value (alpha) is used.

But the end result is that very quickly the probability of a language shifts to 
either 100% or almost 0%, assuming a run of more than a few ngrams that only 
occur in that one language (or never/rarely occur in that language).

This is a known issue with the approach used by Tika’s default language 
detector. The best way to work around it currently is to segment the text (e.g. 
by paragraphs), and then combine the calculated probabilities for each piece. 
E.g. 75% of paragraphs are English, and 25% are French.

> In my 400 page example the French part is about 4 pages (= one chapter), so 
> 1% of the full text and I’m a bit confused why that isn’t mentioned in the 
> result.
>  
> Splitting into paragraphs would probably be done with PDFBox, right? Looking 
> at old questions on Stackexchange it seems to be semi-easy to do (basically a 
> split with „\n“?) but there’s no guarantee that it’ll actually find all 
> paragraphs.

I’ll let someone more knowledgeable about PDFBox (Tilman???) respond. But even 
if it doesn’t correctly segment every paragraph, if it gets most of them right 
then that’s going to be much better.

> In the end I only actually care about the languages, the probabilities I’d 
> only use to see if it’s even worth mentioning a specific one if it should 
> return more than one for longer text samples.
>  
>  
> Von: Ken Krugler <[email protected] 
> <mailto:[email protected]>> 
> Gesendet: Montag, 1. Februar 2021 17:29
> An: [email protected] <mailto:[email protected]>
> Betreff: Re: Detecting multiple languages in a long text
>  
> Hi Julia,
>  
> So the goal is to have detection results show some non-zero probability for 
> the other languages, right?
>  
> In general doing this for long runs of text is almost impossible using 
> probabilistic models.
>  
> What you need to do is break the text up into some smaller units (by page or 
> even better by paragraph, for example) and then do detection separately on 
> each chunk of text.
>  
> Then based on those results, you can decide how you want to report actual 
> content…which isn’t straightforward.
>  
> E.g. what if only one paragraph (out of many) had a 10% chance of being 
> Greek, because it contained one sentence in Greek, but everything else was 
> English? Would you want to report the total document as English, or English 
> with some Greek, or something else?
>  
> Regards,
>  
> — Ken
>  
>  
>> On Feb 1, 2021, at 5:39 AM, Julia Ruzicka <[email protected] 
>> <mailto:[email protected]>> wrote:
>>  
>> Hello everyone!
>>  
>> I’m using Tika 1.25 to detect the language of a long text that I read from a 
>> PDF (using PDFBox 2.0.22):
>>  
>> LanguageDetector detector = new OptimaizeLangDetector();
>> detector.loadModels();
>> List<LanguageResult> languages = detector.detectAll(text);
>>  
>> The text is about 400 pages and most of it is in English, with a couple of 
>> pages in French, a few paragraphs in Greek and a couple of Arabic and German 
>> sentences.
>> I know that language detection needs a long-ish text sample for the 
>> detection to work, so I'm fine with the short Arabic/German sentences not 
>> being detected. Running the code above with just a short sample in French or 
>> Greek, the detector finds the right language but if I use the whole text as 
>> input, the result is:
>> en (0.9999969) = English with a 99.99969% probability
>>  
>> It doesn’t list the other languages.
>>  
>> If I give the detector a mixed sample, it only detects both languages if 
>> they’re about the same amount of text.
>> If one part in e.g. French is 5 lines of text (~65 words) and the second in 
>> e.g. Greek is 7 lines of text (~80 word), the result is:
>> el (0.99999815) = Greek
>>  
>> With 55 words in French and 45 words in Greek the result is:
>> fr (0.5714264)
>> el (0.4285709)
>>  
>> I also tried to do it the alternative way:
>>  
>> detector.setMixedLanguages(true);
>> detector.addText(text);
>> List<LanguageResult> languages = detector.detectAll();
>>  
>> This also only lists a single language with the full text and my first 
>> French-Greek text sample.
>>  
>> How do I get the other languages (in my case: French & Greek) as a result 
>> too?
> 
>  
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply via email to