Just a note, all the .git URLs listed below are git repositories, and there isn't a web interface to them on my server, so just clone them directly like this:
git clone http://ancientgreekocr.org/mignetools.git Nick On Thu, Oct 29, 2015 at 06:23:21PM +0000, Nick White wrote: > Hi all, > > I recently finally got around to organising and releasing some > (well, a lot of) ground truth files for the language I have been > training for ages now, Ancient Greek. By "ground truth" I mean real > page scans with the corresponding (hand-typed) correct text, which > is essential to be able to test the accuracy of OCR results. > > I thought it might be helpful or interesting for others to share how > I went about it. > > In my case the best source was an old (public domain) book that I > had the hand-typed text for, for which several different scans of > the book existed. I then split the text to one file per page, and > named it the same as the page scan file for that page, but with a > .txt file extension. > > This book also had translations of the text in Latin, which I didn't > want to preserve, so I selected only the Ancient Greek parts and > stored their locations using the .uzn format. I did this using a > little program I wrote a while ago that uses the Tesseract C-API to > analyse the page layout of this type of book, select the relevant > parts, and detect the language of each section, printing an uzn file > describing them all. It is very specific to this type of book, but > in case you're curious you can find it in migneuzn.c in the > repository: http://ancientgreekocr.org/mignetools.git > > A while ago I forked a repository of the ISRI OCR evaluation tools > to make them work easily with UTF-8, and included some helper > scripts: http://ancientgreekocr.org/ocr-evaluation-tools.git > Of particular relevance here is the 'tessaccsummary' script, which > when given a directory of images and corresponding ground truth text > and a .traineddata file will OCR each page and print the accuracy, > and an average summary at the end. It is all quite basic, but very > handy. > > I decided to store the ground truth files in a git repository; while > in some ways it isn't an ideal way to store lots of binary files > (like page scans), actually the page scans are never likely to > change, so the size won't get out of hand as it would if the binary > files changed regularly, so I think it's fine. That said it is about > 4.5GiB on disk. The ground truth repository is at > http://ancientgreekocr.org/grcground.git but as I say it's pretty > massive, so please don't clone it unless you think you'll actually > at least look at it, as the bandwidth will cost me :) > > I think it would be really good if others interested in other > languages collected and shared some ground truth files. The more > rigorous testing we do of our OCR training files, the better our > results will end up being. I am working with Latin OCR now, so will > probably do something similar for that soon. Is anyone else > interested? > > Nick > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/20151029182321.GB4904%40manta.lan. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20151029182739.GA9966%40manta.lan. For more options, visit https://groups.google.com/d/optout.

