I am trying to cover as much as I can of the latin unicode characters in the BMP.
What I find is that as I add more characters, the ocr results get worse. For example, instead of getting the correct ö I get Ö and then as I added more characters the latest result is Ṏ. In otherwords, not only is it getting worse at detecting capitalization correctly, but it is favoring more complex characters over the simpler solutions! This is just one example, another is Ȧ instead of correctly getting A. When I run a smaller set of training data I get better results (for the trained ones, of course others are missed completely). Should I be trying to do smaller, multiple, traineddata files? This will reduce performance, but I need accuracy most of all. Plus I've had problems where confidence is reported high on incorrect result, and lower on correct results. I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh script. Linked are files I'm using, a sample image, and the traineddata. Plus an example image I ocr. https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing The unicode ranges I am trying to train for at the moment are. 0000 - 007f Basic Latin 0080 - 00ff Latin 1 Supplemental 0100 - 017f Latin Ext A 0180 - 024f Latin Ext B 1e00 - 1eff Latin Extended Additional 2500 - 2594 Box Draw and Box Elements fb00 - fb06 Ligatures Using the following fonts for training arial unicode ms freeserif liberation mono liberation sans liberation sans narrow condensed liberation serif segoe ui I can certainly add more if that helps, but so far adding fonts just means it takes longer to realize how bad the trained data is. If you are asking why I am doing this, it is because I am trying to create a language agnostic solution. You can see a test image in the link above, and can see I am only looking at font glyphs, not full page ocr. Any suggestions/advice appreciated! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.