I downloaded njwtesstools, unpacked it and ran make on it.
*Test 1*
I ran tesseract to train it up on a few fonts.* *The txt files produced
were full of blank characters. It seems to be important to separate the
tokens in each file name with a hyphen. Eventually I settled on using lower
case only in the font names everywhere. I chose just one font to test on.
Running mftraining produced shapetable file which is not mentioned in the
documentation, as well as epo.unicharset, pffmtable, inttemp; cftraining
produced normproto.
I took two shortish lists of Esperanto words I had and converted them to
epo.freq-dawg, epo.word-dawg. I renamed files with the prefix epo. And ran
*combine_tessdata
epo.*
*sudo cp epo.traineddata /usr/share/tesseract-ocr/tessdata/*
I tested on a tif file of a page from a magazine called Monato. It
recognised most of the accented letters, but gave some errors. Not bad for
only one font. I need more fonts.
*Test 2*
I set out to use eight Freesans and Freeserif fonts but ran into a problem
with the tif files produced:
*$ tesseract epo.freeserif-regular.exp0.tif epo.freeserif-regular.exp0
nobatch box.train*
Tesseract Open Source OCR Engine v3.02 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Unsupported image type.
I found a comment on the tesseract-ocr group that it is better to use png
files. (I thought that we had to use tif files, but that restriction seems
to have gone in V3). I started again and generated png files.
*./lazytrain ../../epo.calib.txt freeserif-regular
epo.freeserif-regular.exp0.png epo.freeserif-regular.exp0.box*
*tesseract epo.freeserif-regular.exp0.png epo.freeserif-regular.exp0
nobatch box.train*
Same for the other seven fonts.
Clustering:
*unicharset_extractor epo.freesans-bold.exp0.box
epo.freesans-bold-italic.exp0.box epo.freesans.exp0.box
epo.freesans-italic.exp0.box epo.freeserif-bold.exp0.box
epo.freeserif-bold-italic.exp0.box epo.freeserif-italic.exp0.box
epo.freeserif-regular.exp0.box*
font_properties file:
freesans 0 0 0 0 0
freesans-bold 0 1 0 0 0
freesans-italic 1 0 0 0 0
freesans-bold-italic 1 1 0 0 0
freeserif-regular 0 0 0 1 0
freeserif-bold 0 1 0 1 0
freeserif-italic 1 0 0 1 0
freeserif-bold-italic 1 1 0 1 0
*mftraining -F font_properties -U unicharset -O epo.unicharset
epo.freesans-bold.exp0.tr epo.freesans-bold-italic.exp0.tr epo.freesans.exp0.tr
epo.freesans-italic.exp0.tr epo.freeserif-bold.exp0.tr
epo.freeserif-bold-italic.exp0.tr epo.freeserif-italic.exp0.tr
epo.freeserif-regular.exp0.tr*
This generated a new shapetable file.*cntraining epo.freesans-bold.exp0.tr
epo.freesans-bold-italic.exp0.tr epo.freesans.exp0.tr
epo.freesans-italic.exp0.tr epo.freeserif-bold.exp0.tr
epo.freeserif-bold-italic.exp0.tr epo.freeserif-italic.exp0.tr
epo.freeserif-regular.exp0.tr*
I renamed files with prefix epo. and ran *combine_tessdata epo.*
I copied traineddata to the tessdata directory. It has no dictionary files.
*sudo cp epo.traineddata /usr/share/tesseract-ocr/tessdata/*
Test: *tesseract ../monato.tif monato -l epo*
Results: 1.5% character errors. Most accented letters recognised. Frequent
errors: l → I, e → c, il → ü, li → h, o → O
What should I do next? Dictionaries? I have a list of nearly 500,000
Esperanto words. Is that too big? Ambigs?
Regards
Donaldo
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en