Hi Ben, On 13/05/2021 02:34, Ben Crowell wrote: > > Only 68% of Greek words are correctly recognized as Greek, and even of > those, some are misread. Extremely common words like μοι, ὁς, and και are > not recognized, although they are mostly recognized when I OCR the text > with the language set only to Greek. So as far as I can tell, tesseract > just can't really do this kind of bilingual text with a non-Latin font. Of > course, there could be something I'm not understanding that would improve > things. > > From descriptions I've read, it seems that tesseract's neural network is > designed to try to scan large blocks of text at once, not just individual > words. I suspect that this makes it unwilling to read Greek as Greek when > it's surrounded by English. This would help to explain why it reads ὁς > correctly when in Greek-only mode, but when in English+Greek mode, it reads > it as os, which isn't even a word in the English dictionary I'm using. > > Training it on the book's Greek font may have done as much harm as good. It > gets words like Μουσα right, which it got wrong before, but it makes errors > on words like πολυτροπον and ανθρωπων, spelling them as πολυτροποιν and > ανιθρωπων.
One other venue you could perhaps explore is to OCR the text in each language separately, and somehow pick the words with the highest confidence per word. I haven't tried this and do not know how feasible it is. Also - I am not sure if it helps, but you might want to consider filing a bug report on Github: https://github.com/tesseract-ocr/tesseract/issues Cheers, Merlijn > On Monday, May 10, 2021 at 4:42:12 PM UTC-7 Ben Crowell wrote: > >> Here is a version of the text that I typeset using xelatex, with the >> font DejaVu Serif. It has all the accents, which should make it a good >> typographical match to the data that tesseract was trained on to make the >> grc file. >> [image: tex_output.png] >> Here is the result: >> >> Ἔννεπε declare pot to me, Movoa Muse, >> >> ἄνδρα the man πολύτροπον of many fortunes, >> oc who πλάγχθη wandered μάλα πολλὰ very >> much, ἐπεὶ when émepoe he had destroyed >> ἱερὸν πτολίεθρον the sacred city Τροίης of Troy: >> ἴδε δε and saw ἄστεα towns Kai and ἔγνω >> learnt voov the mood πολλῶν ἀνθρώπων of >> >> Now 73% of Greek words are recognized as Greek. So this is quite a bit >> better, but still fairly poor. It seems really odd to me that tesseract is >> not getting the moon words μοι, ὃς, and καὶ. For comparison, it would be as >> if tesseract was OCRing an English text and not being able to read "me," >> "who," and "and." >> On Monday, May 10, 2021 at 3:20:47 PM UTC-7 Ben Crowell wrote: >> >>> I compiled tesseract from source, which gave me >>> version 5.0.0-alpha-20210401-102-g4374, and used the latest grc.traineddata >>> file. To get a measure of what's going on, I decided to count the number of >>> Greek words rendered as Greek in the first 7 lines of this text, which >>> contain 22 actual Greek words. >>> >>> tesseract 4.1.1, eng+grc -- 14% correct >>> >>> tesseract 5.0.0 on my machine, eng+grc -- 41% correct >>> >>> tesseract 5.0.0 on my machine, eng+ell -- 68% correct >>> >>> tesseract 5.0.0 on archive.org -- 55% correct >>> >>> Several things are similar in your results and mine. The incorrect >>> scanning of ἱερον when surrounded by English words no longer seems to occur >>> in 5.0.0. The word μοι is usually rendered incorrectly, but this may be >>> because there seems to be broken type that causes the descender on the mu >>> to be omitted. Μουσα is read incorrectly as Movca, which is probably >>> because this personification of the Muse isn't in the dictionary. >>> >>> One thing that I hadn't noticed previously is that the accentuation in >>> this text is weird. Although the 18th-century typesetter included the >>> breathing marks, which aren't used in modern Greek, they left out all the >>> acute, grave, and circumflex accents, which would usually have been >>> included in a modern typesetting of an ancient Greek text. So it may be >>> that the dictionary for grc is more appropriate, but the character >>> recognition for ell is better here. I think this can be tested by >>> typesetting the same 7 lines with and without accents. >>> On Monday, May 10, 2021 at 7:34:34 AM UTC-7 Merlijn Wajer wrote: >>> >>>> Hi Ben, >>>> >>>> On 10/05/2021 15:09, Ben Crowell wrote: >>>>> Hi Merlijn, >>>>> >>>>> Thanks very much for your reply. It's encouraging that you were able >>>> to get >>>>> somewhat better results. However, I'm not able to reproduce them. When >>>> I >>>>> use -l eng+ell, the results are still very poor: >>>>> >>>>> 1. Evverre declare wot to me, Movca Muse, >>>>> avopa the man voAvtpotrov of many fortunes, >>>>> ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very >>>>> much, eves when ewepoev he had destroyed >>>>> i d city T { Troy: >>>>> lepov troAscOpor the sacred city Tons of Troy : >>>>> we Se and saw aorea towns «at and eyvo >>>>> learnt vooy the mood πολλων ανθρωπων οἳ >>>>> >>>>> The text uses ancient Greek vocabulary and accentuation, so it >>>> actually >>>>> makes sense to use grc, not ell. >>>> >>>> Ah, my bad. >>>> >>>>> >>>>> I didn't understand what you meant by "using the Archive.org Tesseract >>>>> stack," but a web search on your name led me to archive-pdf-tools, >>>> which >>>>> you're the author of. It's great to have help from someone who's >>>> clearly >>>>> very expert. I just don't know how to diagnose what is different >>>> between >>>>> your setup and mine. It looks like you did the whole first page rather >>>> than >>>>> the piece I posted, so there may be a difference in how the image was >>>>> prepared. I just zoomed in on the archive.org page, took a >>>> screenshot, >>>>> cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, >>>> which >>>>> seems to be the latest official release. Are you running a version >>>> compiled >>>>> from the latest source or something? My >>>>> file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which >>>> came >>>>> from installing the debian package tesseract-ocr-grc, is dated 2017, >>>> which >>>>> seems old, and is 2.2 Mb. The version >>>>> at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like >>>> it was >>>>> changed around 2018. I could try just replacing the file with the >>>> newer >>>>> version, but I have no idea whether that's a reasonable thing to do, >>>> since >>>>> I don't know anything about how the software is designed. >>>> >>>> "using the Archive.org Tesseract stack" means that archive.org will >>>> automatically run Tesseract OCR on uploaded content and make those >>>> results available (so you can compare with your local results). Because >>>> this book predates the integration of Tesseract, I submitted the content >>>> for re-OCRing, using Tesseract, in an attempt to reproduce your results. >>>> >>>> I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek >>>> "ell". >>>> >>>> The version that is being used is Tesseract "5.0.0-alpha-20201231" [1], >>>> the language packs are the latest ones from Git, I believe. Maybe it >>>> would be worth giving the latest version a shot and see if it yields >>>> better results. There is an ubuntu ppa [2] with development >>>> snapshots/versions. Then, if the latest version still results in >>>> unsatisfying results, it would be worth trying to investigate why? >>>> >>>> >>>> Hope this helps, >>>> Cheers, >>>> Merlijn >>>> >>>> [1] >>>> >>>> https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231 >>>> >>>> [2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel >>>> >>> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7ed34596-d531-ae84-d514-5990a26cdb1c%40archive.org.

