Re: [tesseract-ocr] Ground truth files

Nick White Thu, 29 Oct 2015 11:30:13 -0700

Just a note, all the .git URLs listed below are git repositories, 
and there isn't a web interface to them on my server, so just clone 
them directly like this:


  git clone http://ancientgreekocr.org/mignetools.git

Nick

On Thu, Oct 29, 2015 at 06:23:21PM +0000, Nick White wrote:
> Hi all,
> 
> I recently finally got around to organising and releasing some 
> (well, a lot of) ground truth files for the language I have been 
> training for ages now, Ancient Greek. By "ground truth" I mean real 
> page scans with the corresponding (hand-typed) correct text, which 
> is essential to be able to test the accuracy of OCR results.
> 
> I thought it might be helpful or interesting for others to share how 
> I went about it.
> 
> In my case the best source was an old (public domain) book that I 
> had the hand-typed text for, for which several different scans of 
> the book existed. I then split the text to one file per page, and 
> named it the same as the page scan file for that page, but with a 
> .txt file extension.
> 
> This book also had translations of the text in Latin, which I didn't 
> want to preserve, so I selected only the Ancient Greek parts and 
> stored their locations using the .uzn format. I did this using a 
> little program I wrote a while ago that uses the Tesseract C-API to 
> analyse the page layout of this type of book, select the relevant 
> parts, and detect the language of each section, printing an uzn file 
> describing them all. It is very specific to this type of book, but 
> in case you're curious you can find it in migneuzn.c in the 
> repository: http://ancientgreekocr.org/mignetools.git
> 
> A while ago I forked a repository of the ISRI OCR evaluation tools 
> to make them work easily with UTF-8, and included some helper 
> scripts: http://ancientgreekocr.org/ocr-evaluation-tools.git
> Of particular relevance here is the 'tessaccsummary' script, which 
> when given a directory of images and corresponding ground truth text 
> and a .traineddata file will OCR each page and print the accuracy, 
> and an average summary at the end. It is all quite basic, but very 
> handy.
> 
> I decided to store the ground truth files in a git repository; while 
> in some ways it isn't an ideal way to store lots of binary files 
> (like page scans), actually the page scans are never likely to 
> change, so the size won't get out of hand as it would if the binary 
> files changed regularly, so I think it's fine. That said it is about 
> 4.5GiB on disk. The ground truth repository is at 
> http://ancientgreekocr.org/grcground.git but as I say it's pretty 
> massive, so please don't clone it unless you think you'll actually 
> at least look at it, as the bandwidth will cost me :)
> 
> I think it would be really good if others interested in other 
> languages collected and shared some ground truth files. The more 
> rigorous testing we do of our OCR training files, the better our 
> results will end up being. I am working with Latin OCR now, so will 
> probably do something similar for that soon. Is anyone else 
> interested?
> 
> Nick
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/20151029182321.GB4904%40manta.lan.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20151029182739.GA9966%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Ground truth files

Reply via email to