Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Merlijn B.W. Wajer Fri, 14 May 2021 05:47:39 -0700

Hi Ben,

On 13/05/2021 02:34, Ben Crowell wrote:
> 
> Only 68% of Greek words are correctly recognized as Greek, and even of 
> those, some are misread. Extremely common words like μοι,  ὁς, and και are 
> not recognized, although they are mostly recognized when I OCR the text 
> with the language set only to Greek. So as far as I can tell, tesseract 
> just can't really do this kind of bilingual text with a non-Latin font. Of 
> course, there could be something I'm not understanding that would improve 
> things.
> 
> From descriptions I've read, it seems that tesseract's neural network is 
> designed to try to scan large blocks of text at once, not just individual 
> words. I suspect that this makes it unwilling to read Greek as Greek when 
> it's surrounded by English. This would help to explain why it reads ὁς 
> correctly when in Greek-only mode, but when in English+Greek mode, it reads 
> it as os, which isn't even a word in the English dictionary I'm using.
> 
> Training it on the book's Greek font may have done as much harm as good. It 
> gets words like Μουσα right, which it got wrong before, but it makes errors 
> on words like πολυτροπον and ανθρωπων, spelling them as πολυτροποιν and 
> ανιθρωπων.


One other venue you could perhaps explore is to OCR the text in each
language separately, and somehow pick the words with the highest
confidence per word. I haven't tried this and do not know how feasible
it is.

Also - I am not sure if it helps, but you might want to consider filing
a bug report on Github: https://github.com/tesseract-ocr/tesseract/issues

Cheers,
Merlijn

> On Monday, May 10, 2021 at 4:42:12 PM UTC-7 Ben Crowell wrote:
> 
>> Here is a version of the text that I typeset using xelatex, with the 
>> font DejaVu Serif. It has all the accents, which should make it a good 
>> typographical match to the data that tesseract was trained on to make the 
>> grc file.
>> [image: tex_output.png]
>> Here is the result:
>>
>> Ἔννεπε declare pot to me, Movoa Muse,
>>
>> ἄνδρα the man πολύτροπον of many fortunes,
>> oc who πλάγχθη wandered μάλα πολλὰ very
>> much, ἐπεὶ when émepoe he had destroyed
>> ἱερὸν πτολίεθρον the sacred city Τροίης of Troy:
>> ἴδε δε and saw ἄστεα towns Kai and ἔγνω
>> learnt voov the mood πολλῶν ἀνθρώπων of
>>
>> Now 73% of Greek words are recognized as Greek. So this is quite a bit 
>> better, but still fairly poor. It seems really odd to me that tesseract is 
>> not getting the moon words μοι, ὃς, and καὶ. For comparison, it would be as 
>> if tesseract was OCRing an English text and not being able to read "me," 
>> "who," and "and."
>> On Monday, May 10, 2021 at 3:20:47 PM UTC-7 Ben Crowell wrote:
>>
>>> I compiled tesseract from source, which gave me 
>>> version 5.0.0-alpha-20210401-102-g4374, and used the latest grc.traineddata 
>>> file. To get a measure of what's going on, I decided to count the number of 
>>> Greek words rendered as Greek in the first 7 lines of this text, which 
>>> contain 22 actual Greek words.
>>>
>>> tesseract 4.1.1, eng+grc -- 14% correct
>>>
>>> tesseract 5.0.0 on my machine, eng+grc -- 41% correct
>>>
>>> tesseract 5.0.0 on my machine, eng+ell -- 68% correct
>>>
>>> tesseract 5.0.0 on archive.org -- 55% correct
>>>
>>> Several things are similar in your results and mine. The incorrect 
>>> scanning of ἱερον when surrounded by English words no longer seems to occur 
>>> in 5.0.0. The word μοι is usually rendered incorrectly, but this may be 
>>> because there seems to be broken type that causes the descender on the mu 
>>> to be omitted. Μουσα is read incorrectly as Movca, which is probably 
>>> because this personification of the Muse isn't in the dictionary.
>>>
>>> One thing that I hadn't noticed previously is that the accentuation in 
>>> this text is weird. Although the 18th-century typesetter included the 
>>> breathing marks, which aren't used in modern Greek, they left out all the 
>>> acute, grave, and circumflex accents, which would usually have been 
>>> included in a modern typesetting of an ancient Greek text. So it may be 
>>> that the dictionary for grc is more appropriate, but the character 
>>> recognition for ell is better here. I think this can be tested by 
>>> typesetting the same 7 lines with and without accents.
>>> On Monday, May 10, 2021 at 7:34:34 AM UTC-7 Merlijn Wajer wrote:
>>>
>>>> Hi Ben, 
>>>>
>>>> On 10/05/2021 15:09, Ben Crowell wrote: 
>>>>> Hi Merlijn, 
>>>>>
>>>>> Thanks very much for your reply. It's encouraging that you were able 
>>>> to get 
>>>>> somewhat better results. However, I'm not able to reproduce them. When 
>>>> I 
>>>>> use -l eng+ell, the results are still very poor: 
>>>>>
>>>>> 1. Evverre declare wot to me, Movca Muse, 
>>>>> avopa the man voAvtpotrov of many fortunes, 
>>>>> ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very 
>>>>> much, eves when ewepoev he had destroyed 
>>>>> i d city T { Troy: 
>>>>> lepov troAscOpor the sacred city Tons of Troy : 
>>>>> we Se and saw aorea towns «at and eyvo 
>>>>> learnt vooy the mood πολλων ανθρωπων οἳ 
>>>>>
>>>>> The text uses ancient Greek vocabulary and accentuation, so it 
>>>> actually 
>>>>> makes sense to use grc, not ell. 
>>>>
>>>> Ah, my bad. 
>>>>
>>>>>
>>>>> I didn't understand what you meant by "using the Archive.org Tesseract 
>>>>> stack," but a web search on your name led me to archive-pdf-tools, 
>>>> which 
>>>>> you're the author of. It's great to have help from someone who's 
>>>> clearly 
>>>>> very expert. I just don't know how to diagnose what is different 
>>>> between 
>>>>> your setup and mine. It looks like you did the whole first page rather 
>>>> than 
>>>>> the piece I posted, so there may be a difference in how the image was 
>>>>> prepared. I just zoomed in on the archive.org page, took a 
>>>> screenshot, 
>>>>> cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, 
>>>> which 
>>>>> seems to be the latest official release. Are you running a version 
>>>> compiled 
>>>>> from the latest source or something? My 
>>>>> file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which 
>>>> came 
>>>>> from installing the debian package tesseract-ocr-grc, is dated 2017, 
>>>> which 
>>>>> seems old, and is 2.2 Mb. The version 
>>>>> at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like 
>>>> it was 
>>>>> changed around 2018. I could try just replacing the file with the 
>>>> newer 
>>>>> version, but I have no idea whether that's a reasonable thing to do, 
>>>> since 
>>>>> I don't know anything about how the software is designed. 
>>>>
>>>> "using the Archive.org Tesseract stack" means that archive.org will 
>>>> automatically run Tesseract OCR on uploaded content and make those 
>>>> results available (so you can compare with your local results). Because 
>>>> this book predates the integration of Tesseract, I submitted the content 
>>>> for re-OCRing, using Tesseract, in an attempt to reproduce your results. 
>>>>
>>>> I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek 
>>>> "ell". 
>>>>
>>>> The version that is being used is Tesseract "5.0.0-alpha-20201231" [1], 
>>>> the language packs are the latest ones from Git, I believe. Maybe it 
>>>> would be worth giving the latest version a shot and see if it yields 
>>>> better results. There is an ubuntu ppa [2] with development 
>>>> snapshots/versions. Then, if the latest version still results in 
>>>> unsatisfying results, it would be worth trying to investigate why? 
>>>>
>>>>
>>>> Hope this helps, 
>>>> Cheers, 
>>>> Merlijn 
>>>>
>>>> [1] 
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231
>>>>  
>>>> [2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel 
>>>>
>>>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7ed34596-d531-ae84-d514-5990a26cdb1c%40archive.org.

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Reply via email to