Hi Ben, On 09/05/2021 21:33, Ben Crowell wrote: > I'm trying to OCR a book that is written in interspersed Greek and English: > > https://archive.org/details/odysseyofhomerco01gile/page/n5/mode/2up > > Here is a sample of text from the first page: > > [image: a.jpg] > > I'm running tesseract 4.1.1 on linux, with the tesseract-ocr-grc package > installed. Here's the command I'm using to OCR this sample: > > tesseract a.jpg temp -l eng+grc > > Here is the result: > > 1. Evverre declare wot to me, Movca Muse, > ανδρα the man voAvtpotrov of many fortunes, > os who wAayyx@n wandered μαλα πολλα very > much, eves when ewepoev he had destroyed > i d city T { Troy: > lepov troAscOpor the sacred city Tons of Troy : > we Se and saw aorea towns «at and eyvo > learnt vooy the mood roAAwy avOpwror of > > Basically it almost never recognizes Greek as Greek, and instead tries to > read it as English 95% of the time. Here is what I get if I just tell > tesseract to treat it as Greek: > > 1. ἔννεπε ἀδοίατο μοι ἴο 1π0, ἴἥουσα δίαΞο, > ανδρα {11 τπᾶπι πολύτροπον οἱ ΤΩΔ}Υ ἰοτέιπο5, > ὃς ψ|ὸ πλαγχθὴ παπάθιοα μαλα πολλα νετῦ > τω ποἢ, ἐπὲῦ ΠῸπ ἐπέρσεν ᾿ἰ6 πα ἀεβίτογοά > ; ἀο Τ {Ττου: > ἱερον πτολίεθρον [116 ΞΔογβα οἷἵγ Τροιης οἷ ΤτοΥ : > ἰδὲ δε ἀμ 5 αστεῶ ἰο 8 καὶ ἃπά εγνω > Ἰρατηῦ νοὸν {πῸ Ἰηοοὰ πολλων ανθρωπὼν οἵ > > This seems odd to me. Although it still makes some errors, such as reading > Μουσα as ἴἥουσα on the first line, it now gets the common word ἱερον > (holy) correct, whereas in the original attempt, it rendered it as lepov, > which is not a word in either language. If it's capable of correctly > interpreting ἱερον, which is presumably in its dictionary, then I don't > understand why, when I use eng+grc, it doesn't get it right. > > I tried cropping this sample so it was only the single word: > > [image: aa.jpg] > When I read this using -l eng+grc, it gets it right. So it seems as though > it's perfectly capable of both recognizing this word as Greek and properly > OCRing it, but somehow it's reluctant to do so when some of the surrounding > text is in English. > > So in summary, although there are some errors that may have to do with > image quality or not being trained on this font, there is also some other > kind of problem where tesseract doesn't like to "switch gears" from one > language to the other. > > Can anyone help with diagnosing and/or fixing this problem? > > Could the issue have anything to do with the fact that the Latin letters > are upright, while the Greek ones are in a slanted/italic font? Does the > neural network have a preference for English because the English corpus it > was trained on was so huge compared to the Greek one?
I took the liberty to re-run OCR for that item using the Archive.org Tesseract stack (and also provide Greek as a language), and this is the result of the quoted paragraph - it's not perfect, but better than what you are seeing I think): > i SOMERS ODYSSEY. > > > BOOK I. > > > 1. Έννεπε declare µοι to me, Movca Muse, > avdpa the man πολυτροπον of many fortunes, > os who πλαγχθη wandered pada πολλα very > much, eves when επερσεν he had destroyed > ἱερον πτολιεθρον the sacred city Τροιης of Troy : > we δε and saw αστεα towns και and εγνω > learnt vooy the mood πολλων ανθρωπων of > many men, πολλα δε αλγεα but many sorrows > oye he indeed παθε suffered ὁν κατα θυµον in > his soul, apyvper'os whilst grasping ἦν τε Wyn? > both his own life και and νοστον the return erat. > pov of his companions. Adda but ουδε not even > ὡς thus ερρυσατο did he save έταρους his com- > panions, iewevos περ though bent upon it: > ολοντο yap for they perished σφετερησιν ατασ- > σθαλιῃσι by their own phrensies, νηπιοι fools, > οἱ who κατα ησθιον ate up βους the oxen > Heduovo of the Sun ὝὙπεριονος who rolls above > Us : autap but ὁ he αφειλετο took away Tors I wonder if the problem you were seeing was related to using Ancient Greek (grc) as opposed to Greek (ell)? These are the parameters that were used just now: > ocr_parameters -l eng+ell Hope this helps. Cheers, Merlijn -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f9d0f04c-ea14-3c85-9b6e-e5d346c0a3fd%40archive.org.

