Hi all, I recently finally got around to organising and releasing some (well, a lot of) ground truth files for the language I have been training for ages now, Ancient Greek. By "ground truth" I mean real page scans with the corresponding (hand-typed) correct text, which is essential to be able to test the accuracy of OCR results.
I thought it might be helpful or interesting for others to share how I went about it. In my case the best source was an old (public domain) book that I had the hand-typed text for, for which several different scans of the book existed. I then split the text to one file per page, and named it the same as the page scan file for that page, but with a .txt file extension. This book also had translations of the text in Latin, which I didn't want to preserve, so I selected only the Ancient Greek parts and stored their locations using the .uzn format. I did this using a little program I wrote a while ago that uses the Tesseract C-API to analyse the page layout of this type of book, select the relevant parts, and detect the language of each section, printing an uzn file describing them all. It is very specific to this type of book, but in case you're curious you can find it in migneuzn.c in the repository: http://ancientgreekocr.org/mignetools.git A while ago I forked a repository of the ISRI OCR evaluation tools to make them work easily with UTF-8, and included some helper scripts: http://ancientgreekocr.org/ocr-evaluation-tools.git Of particular relevance here is the 'tessaccsummary' script, which when given a directory of images and corresponding ground truth text and a .traineddata file will OCR each page and print the accuracy, and an average summary at the end. It is all quite basic, but very handy. I decided to store the ground truth files in a git repository; while in some ways it isn't an ideal way to store lots of binary files (like page scans), actually the page scans are never likely to change, so the size won't get out of hand as it would if the binary files changed regularly, so I think it's fine. That said it is about 4.5GiB on disk. The ground truth repository is at http://ancientgreekocr.org/grcground.git but as I say it's pretty massive, so please don't clone it unless you think you'll actually at least look at it, as the bandwidth will cost me :) I think it would be really good if others interested in other languages collected and shared some ground truth files. The more rigorous testing we do of our OCR training files, the better our results will end up being. I am working with Latin OCR now, so will probably do something similar for that soon. Is anyone else interested? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20151029182321.GB4904%40manta.lan. For more options, visit https://groups.google.com/d/optout.

