Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Merlijn B.W. Wajer Mon, 10 May 2021 03:39:08 -0700

Hi Ben,

On 09/05/2021 21:33, Ben Crowell wrote:
> I'm trying to OCR a book that is written in interspersed Greek and English:
> 
> https://archive.org/details/odysseyofhomerco01gile/page/n5/mode/2up
> 
> Here is a sample of text from the first page:
> 
> [image: a.jpg]
> 
> I'm running tesseract 4.1.1 on linux, with the tesseract-ocr-grc package 
> installed. Here's the command I'm using to OCR this sample:
> 
> tesseract a.jpg temp -l eng+grc
> 
> Here is the result:
> 
> 1. Evverre declare wot to me, Movca Muse,
> ανδρα the man voAvtpotrov of many fortunes,
> os who wAayyx@n wandered μαλα πολλα very
> much, eves when ewepoev he had destroyed
> i d city T { Troy:
> lepov troAscOpor the sacred city Tons of Troy :
> we Se and saw aorea towns «at and eyvo
> learnt vooy the mood roAAwy avOpwror of
> 
> Basically it almost never recognizes Greek as Greek, and instead tries to 
> read it as English 95% of the time. Here is what I get if I just tell 
> tesseract to treat it as Greek:
> 
> 1. ἔννεπε ἀδοίατο μοι ἴο 1π0, ἴἥουσα δίαΞο,
> ανδρα {11 τπᾶπι πολύτροπον οἱ ΤΩΔ}Υ ἰοτέιπο5,
> ὃς ψ|ὸ πλαγχθὴ παπάθιοα μαλα πολλα νετῦ
> τω ποἢ, ἐπὲῦ ΠῸπ ἐπέρσεν ᾿ἰ6 πα ἀεβίτογοά
> ; ἀο Τ {Ττου:
> ἱερον πτολίεθρον [116 ΞΔογβα οἷἵγ Τροιης οἷ ΤτοΥ :
> ἰδὲ δε ἀμ 5 αστεῶ ἰο 8 καὶ ἃπά εγνω
> Ἰρατηῦ νοὸν {πῸ Ἰηοοὰ πολλων ανθρωπὼν οἵ
> 
> This seems odd to me. Although it still makes some errors, such as reading 
> Μουσα as  ἴἥουσα on the first line, it now gets the common word ἱερον 
> (holy) correct, whereas in the original attempt, it rendered it as lepov, 
> which is not a word in either language. If it's capable of correctly 
> interpreting ἱερον, which is presumably in its dictionary, then I don't 
> understand why, when I use eng+grc, it doesn't get it right.
> 
> I tried cropping this sample so it was only the single word:
> 
> [image: aa.jpg]
> When I read this using -l eng+grc, it gets it right. So it seems as though 
> it's perfectly capable of both recognizing this word as Greek and properly 
> OCRing it, but somehow it's reluctant to do so when some of the surrounding 
> text is in English.
> 
> So in summary, although there are some errors that may have to do with 
> image quality or not being trained on this font, there is also some other 
> kind of problem where tesseract doesn't like to "switch gears" from one 
> language to the other.
> 
> Can anyone help with diagnosing and/or fixing this problem?
> 
> Could the issue have anything to do with the fact that the Latin letters 
> are upright, while the Greek ones are in a slanted/italic font? Does the 
> neural network have a preference for English because the English corpus it 
> was trained on was so huge compared to the Greek one?


I took the liberty to re-run OCR for that item using the Archive.org
Tesseract stack (and also provide Greek as a language), and this is the
result of the quoted paragraph - it's not perfect, but better than what
you are seeing I think):

> i SOMERS ODYSSEY. 
> 
> 
> BOOK I. 
> 
> 
> 1. Έννεπε declare µοι to me, Movca Muse, 
> avdpa the man πολυτροπον of many fortunes, 
> os who πλαγχθη wandered pada πολλα very 
> much, eves when επερσεν he had destroyed 
> ἱερον πτολιεθρον the sacred city Τροιης of Troy : 
> we δε and saw αστεα towns και and εγνω 
> learnt vooy the mood πολλων ανθρωπων of 
> many men, πολλα δε αλγεα but many sorrows 
> oye he indeed παθε suffered ὁν κατα θυµον in 
> his soul, apyvper'os whilst grasping ἦν τε Wyn? 
> both his own life και and νοστον the return erat. 
> pov of his companions. Adda but ουδε not even 
> ὡς thus ερρυσατο did he save έταρους his com- 
> panions, iewevos περ though bent upon it: 
> ολοντο yap for they perished σφετερησιν ατασ- 
> σθαλιῃσι by their own phrensies, νηπιοι fools, 
> οἱ who κατα ησθιον ate up βους the oxen 
> Heduovo of the Sun ὝὙπεριονος who rolls above 
> Us : autap but ὁ he αφειλετο took away Tors 

I wonder if the problem you were seeing was related to using Ancient
Greek (grc) as opposed to Greek (ell)? These are the parameters that
were used just now:

> ocr_parameters     -l eng+ell 

Hope this helps.

Cheers,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f9d0f04c-ea14-3c85-9b6e-e5d346c0a3fd%40archive.org.

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Reply via email to