Re: Tess v3 not recognising accented Esperanto characters.

Donaldo Mon, 01 Oct 2012 19:33:17 -0700

 

I downloaded njwtesstools, unpacked it and ran make on it.


*Test 1*

I ran tesseract to train it up on a few fonts.* *The txt files produced 
were full of blank characters. It seems to be important to separate the 
tokens in each file name with a hyphen. Eventually I settled on using lower 
case only in the font names everywhere. I chose just one font to test on.

Running mftraining produced shapetable file which is not mentioned in the 
documentation, as well as epo.unicharset, pffmtable, inttemp; cftraining 
produced normproto. 

I took two shortish lists of Esperanto words I had and converted them to 
epo.freq-dawg, epo.word-dawg. I renamed files with the prefix epo. And ran 
*combine_tessdata 
epo.*

*sudo cp epo.traineddata /usr/share/tesseract-ocr/tessdata/*

I tested on a tif file of a page from a magazine called Monato. It 
recognised most of the accented letters, but gave some errors. Not bad for 
only one font. I need more fonts.


 *Test 2*

I set out to use eight Freesans and Freeserif fonts but ran into a problem 
with the tif files produced:

*$ tesseract epo.freeserif-regular.exp0.tif epo.freeserif-regular.exp0 
nobatch box.train*

Tesseract Open Source OCR Engine v3.02 with Leptonica

Error in pixReadFromTiffStream: spp not in set {1,3,4}

Error in pixReadStreamTiff: pix not read

Error in pixReadStream: tiff: no pix returned

Error in pixRead: pix not read

Unsupported image type.


 I found a comment on the tesseract-ocr group that it is better to use png 
files. (I thought that we had to use tif files, but that restriction seems 
to have gone in V3). I started again and generated png files.

*./lazytrain ../../epo.calib.txt freeserif-regular 
epo.freeserif-regular.exp0.png epo.freeserif-regular.exp0.box*

*tesseract epo.freeserif-regular.exp0.png epo.freeserif-regular.exp0 
nobatch box.train*

Same for the other seven fonts.

Clustering:

*unicharset_extractor epo.freesans-bold.exp0.box 
epo.freesans-bold-italic.exp0.box epo.freesans.exp0.box 
epo.freesans-italic.exp0.box epo.freeserif-bold.exp0.box 
epo.freeserif-bold-italic.exp0.box epo.freeserif-italic.exp0.box 
epo.freeserif-regular.exp0.box*

font_properties file:
freesans 0 0 0 0 0
freesans-bold 0 1 0 0 0
freesans-italic 1 0 0 0 0
freesans-bold-italic 1 1 0 0 0
freeserif-regular 0 0 0 1 0
freeserif-bold 0 1 0 1 0
freeserif-italic 1 0 0 1 0
freeserif-bold-italic 1 1 0 1 0
*mftraining -F font_properties -U unicharset -O epo.unicharset 
epo.freesans-bold.exp0.tr epo.freesans-bold-italic.exp0.tr epo.freesans.exp0.tr 
epo.freesans-italic.exp0.tr epo.freeserif-bold.exp0.tr 
epo.freeserif-bold-italic.exp0.tr epo.freeserif-italic.exp0.tr 
epo.freeserif-regular.exp0.tr*
This generated a new shapetable file.*cntraining epo.freesans-bold.exp0.tr 
epo.freesans-bold-italic.exp0.tr epo.freesans.exp0.tr 
epo.freesans-italic.exp0.tr epo.freeserif-bold.exp0.tr 
epo.freeserif-bold-italic.exp0.tr epo.freeserif-italic.exp0.tr 
epo.freeserif-regular.exp0.tr*

I renamed files with prefix epo. and ran *combine_tessdata epo.*

I copied traineddata to the tessdata directory. It has no dictionary files.

*sudo cp epo.traineddata /usr/share/tesseract-ocr/tessdata/*

Test: *tesseract ../monato.tif monato -l epo*

Results: 1.5% character errors. Most accented letters recognised. Frequent 
errors: l → I, e → c, il → ü, li → h, o → O

What should I do next? Dictionaries? I have a list of nearly 500,000 
Esperanto words. Is that too big? Ambigs? 

Regards

Donaldo


-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess v3 not recognising accented Esperanto characters.

Reply via email to