I am trying to cover as much as I can of the latin unicode characters in 
the BMP.

What I find is that as I add more characters, the ocr results get worse.

For example, instead of getting the correct ö I get Ö and then as I added 
more characters the latest result is Ṏ.

In otherwords, not only is it getting worse at detecting capitalization 
correctly, but it is favoring more complex characters over the simpler 
solutions! This is just one example, another is Ȧ instead of correctly 
getting A.

When I run a smaller set of training data I get better results (for the 
trained ones, of course others are missed completely).

Should I be trying to do smaller, multiple, traineddata files? This will 
reduce performance, but I need accuracy most of all. Plus I've had problems 
where confidence is reported high on incorrect result, and lower on correct 
results.

I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh 
script. 

Linked are files I'm using, a sample image, and the traineddata. Plus an 
example image I ocr.

https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing

The unicode ranges I am trying to train for at the moment are.

0000 - 007f Basic Latin
0080 - 00ff Latin 1 Supplemental
0100 - 017f Latin Ext A
0180 - 024f Latin Ext B
1e00 - 1eff Latin Extended Additional
2500 - 2594 Box Draw and Box Elements
fb00 - fb06 Ligatures

Using the following fonts for training
arial unicode ms
freeserif
liberation mono
liberation sans
liberation sans narrow condensed
liberation serif
segoe ui

I can certainly add more if that helps, but so far adding fonts just means 
it takes longer to realize how bad the trained data is.

If you are asking why I am doing this, it is because I am trying to create 
a language agnostic solution. You can see a test image in the link above, 
and can see I am only looking at font glyphs, not full page ocr.

Any suggestions/advice appreciated!







-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to