[tesseract-ocr] Ground truth files

Nick White Thu, 29 Oct 2015 11:24:08 -0700

Hi all,

I recently finally got around to organising and releasing some 
(well, a lot of) ground truth files for the language I have been 
training for ages now, Ancient Greek. By "ground truth" I mean real 
page scans with the corresponding (hand-typed) correct text, which 
is essential to be able to test the accuracy of OCR results.


I thought it might be helpful or interesting for others to share how 
I went about it.

In my case the best source was an old (public domain) book that I 
had the hand-typed text for, for which several different scans of 
the book existed. I then split the text to one file per page, and 
named it the same as the page scan file for that page, but with a 
.txt file extension.

This book also had translations of the text in Latin, which I didn't 
want to preserve, so I selected only the Ancient Greek parts and 
stored their locations using the .uzn format. I did this using a 
little program I wrote a while ago that uses the Tesseract C-API to 
analyse the page layout of this type of book, select the relevant 
parts, and detect the language of each section, printing an uzn file 
describing them all. It is very specific to this type of book, but 
in case you're curious you can find it in migneuzn.c in the 
repository: http://ancientgreekocr.org/mignetools.git

A while ago I forked a repository of the ISRI OCR evaluation tools 
to make them work easily with UTF-8, and included some helper 
scripts: http://ancientgreekocr.org/ocr-evaluation-tools.git
Of particular relevance here is the 'tessaccsummary' script, which 
when given a directory of images and corresponding ground truth text 
and a .traineddata file will OCR each page and print the accuracy, 
and an average summary at the end. It is all quite basic, but very 
handy.

I decided to store the ground truth files in a git repository; while 
in some ways it isn't an ideal way to store lots of binary files 
(like page scans), actually the page scans are never likely to 
change, so the size won't get out of hand as it would if the binary 
files changed regularly, so I think it's fine. That said it is about 
4.5GiB on disk. The ground truth repository is at 
http://ancientgreekocr.org/grcground.git but as I say it's pretty 
massive, so please don't clone it unless you think you'll actually 
at least look at it, as the bandwidth will cost me :)

I think it would be really good if others interested in other 
languages collected and shared some ground truth files. The more 
rigorous testing we do of our OCR training files, the better our 
results will end up being. I am working with Latin OCR now, so will 
probably do something similar for that soon. Is anyone else 
interested?

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20151029182321.GB4904%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Ground truth files

Reply via email to