[tesseract-ocr] Training data gets worse as I add characters

Ryan Dev Fri, 21 Nov 2014 17:42:04 -0800

I am trying to cover as much as I can of the latin unicode characters in 
the BMP.

What I find is that as I add more characters, the ocr results get worse.

For example, instead of getting the correct ö I get Ö and then as I added
more characters the latest result is Ṏ.

In otherwords, not only is it getting worse at detecting capitalization
correctly, but it is favoring more complex characters over the simpler
solutions! This is just one example, another is Ȧ instead of correctly
getting A.

When I run a smaller set of training data I get better results (for the
trained ones, of course others are missed completely).

Should I be trying to do smaller, multiple, traineddata files? This will
reduce performance, but I need accuracy most of all. Plus I've had problems
where confidence is reported high on incorrect result, and lower on correct
results.

I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh
script.

Linked are files I'm using, a sample image, and the traineddata. Plus an
example image I ocr.

https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing

The unicode ranges I am trying to train for at the moment are.

0000 - 007f Basic Latin
0080 - 00ff Latin 1 Supplemental
0100 - 017f Latin Ext A
0180 - 024f Latin Ext B
1e00 - 1eff Latin Extended Additional
2500 - 2594 Box Draw and Box Elements
fb00 - fb06 Ligatures

Using the following fonts for training
arial unicode ms
freeserif
liberation mono
liberation sans
liberation sans narrow condensed
liberation serif
segoe ui

I can certainly add more if that helps, but so far adding fonts just means
it takes longer to realize how bad the trained data is.

If you are asking why I am doing this, it is because I am trying to create
a language agnostic solution. You can see a test image in the link above,
and can see I am only looking at font glyphs, not full page ocr.

Any suggestions/advice appreciated!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training data gets worse as I add characters

Reply via email to